## Air Quality Index Classification Using Azure Machine Learning

### Azure SDK Installation
This section installs the necessary Azure SDK libraries, such as `azure-ai-ml` and other required dependencies for integrating with Azure Machine Learning services.

In [1]:
pip install azure-ai-ml

Note: you may need to restart the kernel to use updated packages.


### Azure Credential Handling
Here, the notebook uses `DefaultAzureCredential` to handle authentication, with a fallback to `InteractiveBrowserCredential` in case the default method fails.

In [2]:
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient

try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()


### Workspace Setup
This section connects to the Azure Machine Learning workspace using `MLClient.from_config()`, which reads workspace configuration from a `config.json` file.

In [3]:
# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential)

Found the config file in: /config.json


### Custom Environment Setup
A custom environment is defined using an `environment.yml` file. This file specifies the Python version, and includes common libraries like `pandas`, `numpy`, `scikit-learn`, `xgboost`, and others. It also ensures necessary libraries like `imbalanced-learn` are available for model training.

In [5]:
%%writefile environment.yml
name: custom-azureml-sklearn-env
channels:
  - defaults
  - conda-forge
dependencies:
  - python=3.8
  - pandas
  - numpy
  - scikit-learn=0.24
  - catboost
  - lightgbm
  - xgboost
  - pip
  - pip:
      - mlflow
      - imbalanced-learn  # Ensure this line is present
      - argparse
      - matplotlib
      - seaborn
      - azureml-sdk
      

Overwriting environment.yml


In [6]:
from azure.ai.ml.entities import Environment
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# Load environment from the conda YAML file
env = Environment(
    name="custom-azureml-sklearn-env",
    conda_file="environment.yml",
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest"
)

# Register the environment in Azure ML workspace
ml_client.environments.create_or_update(env)


Environment({'arm_type': 'environment_version', 'latest_version': None, 'image': 'mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest', 'intellectual_property': None, 'is_anonymous': False, 'auto_increment_version': False, 'auto_delete_setting': None, 'name': 'custom-azureml-sklearn-env', 'description': None, 'tags': {}, 'properties': {'azureml.labels': 'latest'}, 'print_as_yaml': False, 'id': '/subscriptions/edc61183-6780-4b6c-8c4a-a5fb6ae6b788/resourceGroups/raghavavigneshwar/providers/Microsoft.MachineLearningServices/workspaces/raghava/environments/custom-azureml-sklearn-env/versions/5', 'Resource__source_path': '', 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/raghava/code/Users/User1-44320065', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7f0a4c4d7a00>, 'serialize': <msrest.serialization.Serializer object at 0x7f0a4c4e2dc0>, 'version': '5', 'conda_file': {'channels': ['defaults', 'conda-forge'], 'dependencies': ['python=3.8',

### Python Scripts and JSON Files
In this project, Python scripts are used to define various components of the machine learning workflow, such as data preparation, model training, and evaluation. These scripts are often parameterized to make them reusable and flexible.

The `config.json` file is used to store workspace configuration information, such as the subscription ID, resource group, and workspace name. It allows for seamless connection to the Azure Machine Learning workspace without manually specifying these details in every script.

In [7]:
import os

# create a folder for the script files
script_folder = 'src'
os.makedirs(script_folder, exist_ok=True)
print(script_folder, 'folder created')

src folder created


In [8]:
%%writefile $script_folder/load_data.py
import argparse
import pandas as pd
from pathlib import Path

def main(args):
    df = get_data(args.input_data)
    df.to_csv((Path(args.output_data) / "loaded_data.csv"), index=False)

def get_data(path):
    df = pd.read_csv(path)
    row_count = len(df)
    print('Preparing {} rows of data'.format(row_count))
    return df

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--input_data", dest='input_data', type=str)
    parser.add_argument("--output_data", dest='output_data', type=str)
    args = parser.parse_args()
    return args

if __name__ == "__main__":
    args = parse_args()
    main(args)


Overwriting src/load_data.py


In [9]:
%%writefile $script_folder/clean_data.py
import argparse
import pandas as pd
from pathlib import Path
import os
import pickle
import json
import glob

def main(args):
    df = get_data(args.input_data)
    cleaned_data = clean_data(df)
    cleaned_data.to_csv((Path(args.output_data) / "cleaned_data.csv"), index=False)
    
def get_data(data_path):
    all_files = glob.glob(data_path + "/*.csv")
    df = pd.concat((pd.read_csv(f) for f in all_files), sort=False)
    return df

def clean_data(df):
    df = df.dropna(subset=['AQI'])
    df.reset_index(drop=True, inplace=True)
    print("Removed rows with missing 'AQI' values. Remaining rows: {}".format(len(df)))
    return df

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--input_data", dest='input_data', type=str)
    parser.add_argument("--output_data", dest='output_data', type=str)
    args = parser.parse_args()
    return args

if __name__ == "__main__":
    args = parse_args()
    main(args)


Overwriting src/clean_data.py


In [10]:
%%writefile $script_folder/impute_data.py
import argparse
import pandas as pd
from sklearn.impute import KNNImputer
from pathlib import Path
import os
import pickle
import json
import glob

def main(args):
    df = get_data(args.input_data)
    imputed_data = impute_missing_values(df)
    imputed_data.to_csv((Path(args.output_data) / "imputed_data.csv"), index=False)
    
def get_data(data_path):
    all_files = glob.glob(data_path + "/*.csv")
    df = pd.concat((pd.read_csv(f) for f in all_files), sort=False)
    return df

def impute_missing_values(df):
    missing_features = ['PM2.5', 'PM10', 'NO', 'NO2', 'NOx', 'NH3', 'CO', 'SO2', 'O3', 'Benzene', 'Toluene', 'Xylene']
    for feature in missing_features:
        if df[feature].isnull().sum() == 0:
            continue
        
        non_missing_indices = ~df['AQI'].isnull() & ~df[feature].isnull()
        X = df.loc[non_missing_indices, ['AQI']]
        y = df.loc[non_missing_indices, feature]
        
        imputer = KNNImputer(n_neighbors=5)
        imputer.fit(X, y)
        
        missing_indices = df[feature].isnull()
        X_missing = df.loc[missing_indices, ['AQI']]
        
        if not X_missing.empty:
            imputed_values = imputer.transform(X_missing)
            df.loc[missing_indices, feature] = imputed_values
        
        print(f"Imputed missing values for feature: {feature}")
    return df

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--input_data", dest='input_data', type=str)
    parser.add_argument("--output_data", dest='output_data', type=str)
    args = parser.parse_args()
    return args

if __name__ == "__main__":
    args = parse_args()
    main(args)


Overwriting src/impute_data.py


In [11]:
%%writefile $script_folder/handle_outliers.py
import argparse
import pandas as pd
import numpy as np
from pathlib import Path
import seaborn as sns
import matplotlib.pyplot as plt
import os
import pickle
import json
import glob

def main(args):
    df = get_data(args.input_data)
    visualize_outliers(df, args.output_data)
    print("Outlier visualization complete.")
    final_cleaned_data = replace_outliers(df)
    final_cleaned_data.to_csv((Path(args.output_data) / "final_cleaned_data.csv"), index=False)

def get_data(data_path):
    all_files = glob.glob(data_path + "/*.csv")
    df = pd.concat((pd.read_csv(f) for f in all_files), sort=False)
    return df

def detect_outliers_zscore(data):
    outliers = []
    threshold = 3
    mean = np.mean(data)
    std = np.std(data)
    for value in data:
        z_score = (value - mean) / std
        if np.abs(z_score) > threshold:
            outliers.append(value)
    return outliers

def replace_outliers_with_max(data, outliers):
    Q3 = np.percentile(data, 75)
    max_within_Q3 = np.max(data[data <= Q3])
    data[data.isin(outliers)] = max_within_Q3

def replace_outliers(df):
    numerical_features = ['PM2.5', 'PM10', 'NO', 'NO2', 'NOx', 'NH3', 'CO', 'SO2', 'O3', 'Benzene', 'Toluene', 'Xylene', 'AQI']
    cleaned_df = df.copy()
    for feature in numerical_features:
        outliers = detect_outliers_zscore(cleaned_df[feature])
        if len(outliers) > 0:
            replace_outliers_with_max(cleaned_df[feature], outliers)
    cleaned_df.reset_index(drop=True, inplace=True)
    print("Outliers have been detected and replaced.")
    return cleaned_df

def visualize_outliers(df, output_data_path):
    numerical_features = ['PM2.5', 'PM10', 'NO', 'NO2', 'NOx', 'NH3', 'CO', 'SO2', 'O3', 'Benzene', 'Toluene', 'Xylene', 'AQI']
    plt.figure(figsize=(12, 6))
    sns.boxplot(data=df[numerical_features])
    plt.xlabel('Features')
    plt.ylabel('Values')
    plt.xticks(rotation=45)

    # Save the figure as an image file
    plt.savefig(Path(output_data_path) / "outlier_visualization.png")
    plt.close()  # Close the figure to free up memory

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--input_data", dest='input_data', type=str)
    parser.add_argument("--output_data", dest='output_data', type=str)
    args = parser.parse_args()
    return args

if __name__ == "__main__":
    args = parse_args()
    main(args)


Overwriting src/handle_outliers.py


In [12]:
%%writefile $script_folder/train_test_split_component.py
import argparse
import pandas as pd
import glob
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from pathlib import Path


def main(args):
    df = get_data(args.input_data)

    X_train, X_test, y_train, y_test = split_and_transform_data(df)
    
    # Save the data
    X_train.to_csv(Path(args.X_train_data) , index=False)
    X_test.to_csv(Path(args.X_test_data), index=False)
    y_train.to_csv(Path(args.y_train_data) , index=False)
    y_test.to_csv(Path(args.y_test_data), index=False)


def get_data(data_path):
    all_files = glob.glob(data_path + "/*.csv")
    df = pd.concat((pd.read_csv(f) for f in all_files), sort=False)
    return df



def split_and_transform_data(df):
    # Defining features and target
    features = ['PM2.5', 'PM10', 'NO', 'NO2', 'NOx', 'NH3', 'CO', 'SO2', 'O3', 'Benzene', 'Toluene', 'Xylene']
    target = 'AQI_Bucket'

    # Splitting features and target
    X = df[features]
    y = df[target]

    # Split into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

    # Applying standard scaling to the features
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    # Applying LabelEncoder to the target variable
    label_encoder = LabelEncoder()
    y_train = label_encoder.fit_transform(y_train)
    y_test = label_encoder.transform(y_test)

    # Convert back to DataFrame for saving
    X_train = pd.DataFrame(X_train, columns=features)
    X_test = pd.DataFrame(X_test, columns=features)
    y_train = pd.DataFrame(y_train, columns=[target])
    y_test = pd.DataFrame(y_test, columns=[target])

    return X_train, X_test, y_train, y_test

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--input_data", type=str, required=True, help="Path to input data")
    parser.add_argument("--X_train_data", type=str, required=True, help="Path to save training data")
    parser.add_argument("--X_test_data", type=str, required=True, help="Path to save testing data")
    parser.add_argument("--y_train_data", type=str, required=True, help="Path to save training data")
    parser.add_argument("--y_test_data", type=str, required=True, help="Path to save testing data")
    return parser.parse_args()


if __name__ == "__main__":
    args = parse_args()
    main(args)


Overwriting src/train_test_split_component.py


In [13]:
%%writefile $script_folder/apply_smote.py
import argparse
import pandas as pd
from imblearn.over_sampling import SMOTE
from pathlib import Path


def main(args):
    # Load the training data
    X_train = pd.read_csv(args.X_train_data)
    y_train = pd.read_csv(args.y_train_data)

    # Perform SMOTE to handle imbalanced data
    X_resampled, y_resampled = apply_smote(X_train, y_train)

    # Save the resampled data
    X_resampled.to_csv(Path(args.X_resampled_data), index=False)
    y_resampled.to_csv(Path(args.y_resampled_data), index=False)

def get_data(data_path):
    all_files = glob.glob(data_path + "/*.csv")
    df = pd.concat((pd.read_csv(f) for f in all_files), sort=False)
    return df

def apply_smote(X, y):
    smote = SMOTE(random_state=42)
    X_resampled, y_resampled = smote.fit_resample(X, y.values.ravel())  # Ensure y is the correct shape

    # Convert back to DataFrame for saving
    X_resampled = pd.DataFrame(X_resampled, columns=X.columns)
    y_resampled = pd.DataFrame(y_resampled, columns=y.columns)

    # Print the lengths of the resampled data
    print(f"Length of X_resampled: {len(X_resampled)}")
    print(f"Length of y_resampled: {len(y_resampled)}")

    return X_resampled, y_resampled


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--X_train_data", type=str, required=True, help="Path to training data")
    parser.add_argument("--y_train_data", type=str, required=True, help="Path to training labels")
    parser.add_argument("--X_resampled_data", type=str, required=True, help="Path to save resampled training data")
    parser.add_argument("--y_resampled_data", type=str, required=True, help="Path to save resampled training labels")
    return parser.parse_args()


if __name__ == "__main__":
    args = parse_args()
    main(args)


Overwriting src/apply_smote.py


In [14]:
%%writefile $script_folder/modeling_component.py
import argparse
import pandas as pd
import xgboost as xgb
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
import joblib  # Import joblib to save the model
from azureml.core.run import Run  # Import Azure ML Run

def main(args):
    run = Run.get_context()  # Get the Azure ML run context

    # Load the data
    X_train = pd.read_csv(args.X_train_resampled)
    y_train = pd.read_csv(args.y_train_resampled).values.ravel()  # Flattening the array
    X_test = pd.read_csv(args.X_test)
    X_train1 = pd.read_csv(args.X_train)

    # Initialize the XGBoost classifier
    model = xgb.XGBClassifier(use_label_encoder=False, eval_metric="mlogloss", random_state=42)

    # Define the parameter grid for Grid Search
    param_grid = {
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.2],
        'n_estimators': [50, 100, 150],
        'subsample': [0.5, 0.7, 1.0]
    }

    # Perform Grid Search with cross-validation
    grid_search = GridSearchCV(estimator=model, param_grid=param_grid, scoring='f1', cv=3, verbose=1)

    # Train the model
    grid_search.fit(X_train, y_train)

    # Get the best parameters and score
    best_params = grid_search.best_params_
    best_f1_score = grid_search.best_score_
    print("Best Parameters:", best_params)
    print("Best F1 Score:", best_f1_score)

    # Log parameters and metrics to Azure ML
    run.log("Best F1 Score", best_f1_score)
    for param, value in best_params.items():
        run.log(param, value)

    # Save the best model to a file
    best_model = grid_search.best_estimator_
    joblib.dump(best_model, "best_model.pkl")

    # Upload the model to the run's outputs
    run.upload_file(name='outputs/best_model.pkl', path_or_stream='best_model.pkl')

    # Make predictions
    y_train_pred1 = grid_search.predict(X_train)
    y_train_pred = grid_search.predict(X_train1)
    y_test_pred = grid_search.predict(X_test)

    # Calculate F1 score for training predictions
    train_f1 = f1_score(y_train, y_train_pred1, average='weighted')
    print("Train F1 Score:", train_f1)

    # Log additional metrics to Azure ML
    run.log("Train F1 Score", train_f1)

    # Save predictions
    pd.DataFrame(y_train_pred, columns=["predictions"]).to_csv(args.y_train_pred, index=False)
    pd.DataFrame(y_test_pred, columns=["predictions"]).to_csv(args.y_test_pred, index=False)

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--X_train_resampled", type=str, required=True, help="Path to the resampled X_train data")
    parser.add_argument("--y_train_resampled", type=str, required=True, help="Path to the resampled y_train data")
    parser.add_argument("--X_test", type=str, required=True, help="Path to the X_test data")
    parser.add_argument("--X_train", type=str, required=True, help="Path to the X_train data")
    parser.add_argument("--y_train_pred", type=str, required=True, help="Path to save y_train predictions")
    parser.add_argument("--y_test_pred", type=str, required=True, help="Path to save y_test predictions")
    return parser.parse_args()

if __name__ == "__main__":
    args = parse_args()
    main(args)


Overwriting src/modeling_component.py


In [16]:
%%writefile $script_folder/evaluation_component.py
import argparse
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc
import matplotlib.pyplot as plt
from azureml.core.run import Run

def main(args):
    run = Run.get_context()  # Get Azure ML run context

    # Load the actual and predicted values
    y_train = pd.read_csv(args.y_train_actual).values.ravel()  # Flattening if necessary
    y_train_pred = pd.read_csv(args.y_train_pred).values.ravel()  # Flattening if necessary
    y_test = pd.read_csv(args.y_test_actual).values.ravel()  # Flattening if necessary
    y_test_pred = pd.read_csv(args.y_test_pred).values.ravel()  # Flattening if necessary

    # Calculate metrics for training set
    train_accuracy = accuracy_score(y_train, y_train_pred)
    train_precision = precision_score(y_train, y_train_pred, average='weighted')
    train_recall = recall_score(y_train, y_train_pred, average='weighted')
    train_f1 = f1_score(y_train, y_train_pred, average='weighted')

    # Log training metrics
    run.log("Train Accuracy", train_accuracy)
    run.log("Train Precision", train_precision)
    run.log("Train Recall", train_recall)
    run.log("Train F1 Score", train_f1)

    # Calculate metrics for test set
    test_accuracy = accuracy_score(y_test, y_test_pred)
    test_precision = precision_score(y_test, y_test_pred, average='weighted')
    test_recall = recall_score(y_test, y_test_pred, average='weighted')
    test_f1 = f1_score(y_test, y_test_pred, average='weighted')

    # Log test metrics
    run.log("Test Accuracy", test_accuracy)
    run.log("Test Precision", test_precision)
    run.log("Test Recall", test_recall)
    run.log("Test F1 Score", test_f1)

    # Print the metrics
    print(f"Training Metrics: Accuracy={train_accuracy}, Precision={train_precision}, Recall={train_recall}, F1 Score={train_f1}")
    print(f"Testing Metrics: Accuracy={test_accuracy}, Precision={test_precision}, Recall={test_recall}, F1 Score={test_f1}")

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--y_train_actual", type=str, required=True, help="Path to the actual y_train data")
    parser.add_argument("--y_train_pred", type=str, required=True, help="Path to the predicted y_train data")
    parser.add_argument("--y_test_actual", type=str, required=True, help="Path to the actual y_test data")
    parser.add_argument("--y_test_pred", type=str, required=True, help="Path to the predicted y_test data")
    return parser.parse_args()

if __name__ == "__main__":
    args = parse_args()
    main(args)


Overwriting src/evaluation_component.py


In [17]:
%%writefile load_data.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: load_data
display_name: Load Data
version: 1
type: command
inputs:
  input_data:
    type: uri_file
outputs:
  output_data:
    type: uri_folder
code: ./src
environment: azureml:custom-azureml-sklearn-env@latest
command: >
  python load_data.py --input_data ${{inputs.input_data}} --output_data ${{outputs.output_data}}


Overwriting load_data.yml


In [18]:
%%writefile clean_data.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: clean_data
display_name: Clean Missing AQI Data
version: 1
type: command
inputs:
  input_data:
    type: uri_file
outputs:
  output_data:
    type: uri_folder
code: ./src
environment: azureml:custom-azureml-sklearn-env@latest
command: >
  python clean_data.py --input_data ${{inputs.input_data}} --output_data ${{outputs.output_data}}


Overwriting clean_data.yml


In [19]:
%%writefile impute_data.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: impute_data
display_name: Impute Missing Values
version: 1
type: command
inputs:
  input_data:
    type: uri_file
outputs:
  output_data:
    type: uri_folder
code: ./src
environment: azureml:custom-azureml-sklearn-env@latest
command: >
  python impute_data.py --input_data ${{inputs.input_data}} --output_data ${{outputs.output_data}}


Overwriting impute_data.yml


In [20]:
%%writefile handle_outliers.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: handle_outliers
display_name: Detect and Replace Outliers
version: 1
type: command
inputs:
  input_data:
    type: uri_file
outputs:
  output_data:
    type: uri_folder
code: ./src
environment: azureml:custom-azureml-sklearn-env@latest
command: >
    python handle_outliers.py
    --input_data ${{ inputs.input_data }}
    --output_data ${{ outputs.output_data }}


Overwriting handle_outliers.yml


In [21]:
%%writefile train_test_split_component.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: train_test_split
display_name: Train Test Split Component
version: 1
type: command
inputs:
  input_data:
    type: uri_file
outputs:
  X_train_data:
    type: uri_file
  X_test_data:
    type: uri_file
  y_train_data:
    type: uri_file
  y_test_data:
    type: uri_file
code: ./src
environment: azureml:custom-azureml-sklearn-env@latest
command: >-
  python train_test_split_component.py 
  --input_data ${{inputs.input_data}} 
  --X_train_data ${{outputs.X_train_data}} 
  --X_test_data ${{outputs.X_test_data}} 
  --y_train_data ${{outputs.y_train_data}} 
  --y_test_data ${{outputs.y_test_data}}


Overwriting train_test_split_component.yml


In [22]:
%%writefile apply_smote_component.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: apply_smote
display_name: Apply SMOTE Component
version: 1
type: command
inputs:
  X_train_data:
    type: uri_file
  y_train_data:
    type: uri_file
outputs:
  X_resampled_data:
    type: uri_file
  y_resampled_data:
    type: uri_file
code: ./src
environment: azureml:custom-azureml-sklearn-env@latest
command: >-
  python apply_smote.py 
  --X_train_data ${{inputs.X_train_data}} 
  --y_train_data ${{inputs.y_train_data}} 
  --X_resampled_data ${{outputs.X_resampled_data}} 
  --y_resampled_data ${{outputs.y_resampled_data}}


Overwriting apply_smote_component.yml


In [23]:
%%writefile modeling.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: modeling_component
display_name: Modeling Component
version: 1
type: command
inputs:
  X_train_resampled:
    type: uri_file
  y_train_resampled:
    type: uri_file
  X_test:
    type: uri_file
  X_train:
    type: uri_file
outputs:
  y_train_pred:
    type: uri_file
  y_test_pred:
    type: uri_file
code: ./src
environment: azureml:custom-azureml-sklearn-env@latest
command: >
  python modeling_component.py 
  --X_train_resampled ${{inputs.X_train_resampled}} 
  --y_train_resampled ${{inputs.y_train_resampled}} 
  --X_test ${{inputs.X_test}} 
  --X_train ${{inputs.X_train}} 
  --y_train_pred ${{outputs.y_train_pred}} 
  --y_test_pred ${{outputs.y_test_pred}}


Overwriting modeling.yml


In [24]:
%%writefile evaluation_component.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: evaluate_model
display_name: Model Evaluation Component
version: 1
type: command
inputs:
  y_train_actual:
    type: uri_file
  y_train_pred:
    type: uri_file
  y_test_actual:
    type: uri_file
  y_test_pred:
    type: uri_file
# outputs:
#   evaluation_results:
#     type: uri_file
code: ./src
environment: azureml:custom-azureml-sklearn-env@latest
command: >-
  python evaluation_component.py 
  --y_train_actual ${{inputs.y_train_actual}} 
  --y_train_pred ${{inputs.y_train_pred}} 
  --y_test_actual ${{inputs.y_test_actual}} 
  --y_test_pred ${{inputs.y_test_pred}}


Overwriting evaluation_component.yml


### Components in Azure ML
In Azure Machine Learning, **components** are reusable, modular building blocks that represent specific tasks within a machine learning pipeline. These tasks can range from data loading and preprocessing to model training, evaluation, or even custom operations. Components are typically implemented as Python scripts or executable tasks, with clearly defined input and output interfaces.

Each component operates independently and can be customized with parameters, making them versatile and adaptable to different projects. The modular nature of components enables them to be easily reused across multiple pipelines, allowing data scientists to build workflows efficiently by combining and orchestrating these components as needed. This modularity and reusability not only streamline the machine learning workflow but also improve collaboration, scalability, and resource optimization..


In [25]:
from azure.ai.ml import load_component

# Define the directory where your YAML files are stored
parent_dir = "./"  # Change this path as needed

# Load the components using load_component
load_data_component = load_component(source=parent_dir + "load_data.yml")
clean_data_component = load_component(source=parent_dir + "clean_data.yml")
impute_data_component = load_component(source=parent_dir + "impute_data.yml")
handle_outliers_component = load_component(source=parent_dir + "handle_outliers.yml")
train_test_split_component = load_component(source=parent_dir + "train_test_split_component.yml")
smote_component = load_component(source=parent_dir + "apply_smote_component.yml")
modeling_component = load_component(source=parent_dir + "modeling.yml")
evaluation_component = load_component(source=parent_dir + "evaluation_component.yml")

### Azure ML Pipeline
An **Azure Machine Learning pipeline** orchestrates a series of steps (components) to automate the machine learning workflow. Pipelines can include data ingestion, preprocessing, model training, hyperparameter tuning, and model evaluation. Each step in the pipeline can be run on different compute resources, allowing for parallelism and efficient resource utilization.

In this project, the pipeline is designed to manage multiple tasks like handling data transformation using custom scripts, applying models, and evaluating results in a sequence to ensure the complete ML lifecycle is automated.

In [26]:
from azure.ai.ml import Input
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml.dsl import pipeline

@pipeline()
def AQI_classification(pipeline_job_input):
    load_data = load_data_component(input_data=pipeline_job_input)
    clean_data = clean_data_component(input_data=load_data.outputs.output_data)
    impute_data = impute_data_component(input_data=clean_data.outputs.output_data)
    handle_outliers = handle_outliers_component(input_data=impute_data.outputs.output_data)
    train_test_split_data = train_test_split_component(input_data=handle_outliers.outputs.output_data)
    smote = smote_component( X_train_data=train_test_split_data.outputs.X_train_data, 
        y_train_data=train_test_split_data.outputs.y_train_data)
    modeling = modeling_component(X_train_resampled = smote.outputs.X_resampled_data, y_train_resampled = smote.outputs.y_resampled_data, 
  X_test = train_test_split_data.outputs.X_test_data,X_train = train_test_split_data.outputs.X_train_data)
    evaluation = evaluation_component(y_train_actual = train_test_split_data.outputs.y_train_data ,
  y_train_pred= modeling.outputs.y_train_pred ,
  y_test_actual=train_test_split_data.outputs.y_test_data , 
  y_test_pred = modeling.outputs.y_test_pred)
    return {
         "y_test_actual" : evaluation.outputs.y_test_actual,
        "y_test_pred" : evaluation.outputs.y_test_pred,
     }

pipeline_job = AQI_classification(Input(type=AssetTypes.URI_FILE, path="azureml://subscriptions/edc61183-6780-4b6c-8c4a-a5fb6ae6b788/resourcegroups/RaghavaVigneshwar/workspaces/raghava/datastores/workspaceblobstore/paths/UI/2024-09-27_053645_UTC/city_day.csv"))

In [27]:
print(pipeline_job)

Class AutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class AutoDeleteConditionSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseAutoDeleteSettingSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class IntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class ProtectionLevelSchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class BaseIntellectualPropertySchema: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


display_name: AQI_classification
type: pipeline
inputs:
  pipeline_job_input:
    type: uri_file
    path: azureml://subscriptions/edc61183-6780-4b6c-8c4a-a5fb6ae6b788/resourcegroups/RaghavaVigneshwar/workspaces/raghava/datastores/workspaceblobstore/paths/UI/2024-09-27_053645_UTC/city_day.csv
jobs:
  load_data:
    type: command
    inputs:
      input_data:
        path: ${{parent.inputs.pipeline_job_input}}
    component:
      $schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
      name: load_data
      version: '1'
      display_name: Load Data
      type: command
      inputs:
        input_data:
          type: uri_file
      outputs:
        output_data:
          type: uri_folder
      command: 'python load_data.py --input_data ${{inputs.input_data}} --output_data
        ${{outputs.output_data}}

        '
      environment: azureml:custom-azureml-sklearn-env@latest
      code: /mnt/batch/tasks/shared/LS_root/mounts/clusters/raghava/code/Users/U

In [28]:

pipeline_job.settings.default_compute = "CLUSTER1"
pipeline_job.settings.default_datastore = "workspaceblobstore"
print(pipeline_job)

display_name: AQI_classification
type: pipeline
inputs:
  pipeline_job_input:
    type: uri_file
    path: azureml://subscriptions/edc61183-6780-4b6c-8c4a-a5fb6ae6b788/resourcegroups/RaghavaVigneshwar/workspaces/raghava/datastores/workspaceblobstore/paths/UI/2024-09-27_053645_UTC/city_day.csv
jobs:
  load_data:
    type: command
    inputs:
      input_data:
        path: ${{parent.inputs.pipeline_job_input}}
    component:
      $schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
      name: load_data
      version: '1'
      display_name: Load Data
      type: command
      inputs:
        input_data:
          type: uri_file
      outputs:
        output_data:
          type: uri_folder
      command: 'python load_data.py --input_data ${{inputs.input_data}} --output_data
        ${{outputs.output_data}}

        '
      environment: azureml:custom-azureml-sklearn-env@latest
      code: /mnt/batch/tasks/shared/LS_root/mounts/clusters/raghava/code/Users/U

In [29]:
# submit job to workspace
pipeline_job = ml_client.jobs.create_or_update(
    pipeline_job, experiment_name="pipeline_AQI1"
)

aml_url = pipeline_job.studio_url
print("Monitor your pipeline_job at", aml_url)

[32mUploading src (0.02 MBs):   0%|          | 0/16432 [00:00<?, ?it/s][32mUploading src (0.02 MBs): 100%|██████████| 16432/16432 [00:00<00:00, 152339.78it/s][32mUploading src (0.02 MBs): 100%|██████████| 16432/16432 [00:00<00:00, 150663.03it/s]
[39m



Monitor your pipeline_job at https://ml.azure.com/runs/boring_leg_r7dvr0xmjz?wsid=/subscriptions/edc61183-6780-4b6c-8c4a-a5fb6ae6b788/resourcegroups/raghavavigneshwar/workspaces/raghava&tid=4cfe372a-37a4-44f8-91b2-5faf34253c62
