# Week 1- Graded Assignment 1
Setting up the ML pipeline for IRIS Classifier in Vertex AI platform using GCS as demonstrated in the lecture (Hands-on: Introduction to Google Cloud, Vertex AI) in your GCP account.

## Assignment Objective

1. Store Training Data in Google Storage Bucket

2. Fetch the data from Google Storage Bucket and Successfully execute the IRIS Machine Learning Training Pipeline

3. Store the Output artifacts (Models, logs, etc) in Google cloud storage bucket with folders organized by their training execution timestamp

4. Create a new script for inference and run the inference on eval set after fetching the models from GCS Output Artifacts Bucket

5. Run this Training and inference for 2 times resulting in two output artifact folders in Google cloud storage bucket

6. (Optional) Run this pipeline for two versions of data provided in github data folder

## Initial steps

### Install Vertex AI SDK for Python and other required packages

In [1]:
# Vertex SDK for Python
! pip3 install --upgrade --quiet  google-cloud-aiplatform


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [9]:
!pip install --upgrade google-cloud-storage
!pip install --upgrade google-cloud-storage google-cloud-aiplatform


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Set Google Cloud project information

In [2]:
PROJECT_ID = "lively-nimbus-473407-m9"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

In [4]:
BUCKET_URI = f"gs://mlops-lively-nimbus-473407-m9"  # @param {type:"string"}

**If your bucket doesn't already exist** : Run the following cell to create your Cloud Storage bucket.

In [5]:
! gsutil mb -l {LOCATION} -p {PROJECT_ID} {BUCKET_URI}

Creating gs://mlops-lively-nimbus-473407-m9/...


### Initialize Vertex AI SDK for Python

In [10]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=LOCATION, staging_bucket=BUCKET_URI)

## 1. Store Training Data in Google Storage Bucket

### Fetch data from git repository
Data can not be saved directly to the cloud storage bucket from git repository, therefore we need an intermediate step i.e. to store the data locally

In [11]:
! git clone --branch week_1 https://github.com/IITMBSMLOps/ga_resources.git

Cloning into 'ga_resources'...
remote: Enumerating objects: 37, done.[K
remote: Counting objects: 100% (37/37), done.[K
remote: Compressing objects: 100% (32/32), done.[K
remote: Total 37 (delta 8), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (37/37), 27.40 KiB | 6.85 MiB/s, done.
Resolving deltas: 100% (8/8), done.


### Save data to bucket

In [12]:
! gsutil cp -r ga_resources/data/ {BUCKET_URI}/

Copying file://ga_resources/data/v1/data.csv [Content-Type=text/csv]...
Copying file://ga_resources/data/raw/iris.csv [Content-Type=text/csv]...        
Copying file://ga_resources/data/v2/data.csv [Content-Type=text/csv]...         
/ [3 files][  7.6 KiB/  7.6 KiB]                                                
Operation completed over 3 objects/7.6 KiB.                                      


## 2. IRIS Machine Learning Training Pipeline

### Import important libraries

In [13]:
import os
import sys
import pandas as pd
import numpy as np
from pandas.plotting import parallel_coordinates
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn import metrics
from zoneinfo import ZoneInfo
from datetime import datetime

### Fetch data from the bucket

In [14]:
! gsutil cp -r {BUCKET_URI}/data/ .

Copying gs://mlops-lively-nimbus-473407-m9/data/raw/iris.csv...
Copying gs://mlops-lively-nimbus-473407-m9/data/v1/data.csv...                  
Copying gs://mlops-lively-nimbus-473407-m9/data/v2/data.csv...                  
/ [3 files][  7.6 KiB/  7.6 KiB]                                                
Operation completed over 3 objects/7.6 KiB.                                      


### Import Dataset

Remember to **update the path of data csv**

In [47]:
# data = pd.read_csv('data/raw/iris.csv')
# data = pd.read_csv('data/v1/data.csv')
data = pd.read_csv('data/v2/data.csv')
data.head(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


### Train Test Split

In [48]:
train, test = train_test_split(data, test_size = 0.4, stratify = data['species'], random_state = 42)
X_train = train[['sepal_length','sepal_width','petal_length','petal_width']]
y_train = train.species
X_test = test[['sepal_length','sepal_width','petal_length','petal_width']]
y_test = test.species

### Eval Set

In [49]:
X_train, X_eval, y_train, y_eval = train_test_split(X_train, y_train, test_size = 0.2, stratify = y_train, random_state = 42)

### Simple Decision Tree model

Build a Decision Tree model on iris data

In [50]:
mod_dt = DecisionTreeClassifier(max_depth = 3, random_state = 1)
mod_dt.fit(X_train,y_train)

0,1,2
,criterion,'gini'
,splitter,'best'
,max_depth,3
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,
,random_state,1
,max_leaf_nodes,
,min_impurity_decrease,0.0


## 3. Store the Output artifacts

Store the Output artifacts (Models, logs, etc) in Google cloud storage bucket with folders organized by their training execution timestamp



### Path to your model artifacts

`MODEL_ARTIFACT_DIR` - Folder directory path to your model artifacts within a Cloud Storage bucket



In [51]:
MODEL_ARTIFACT_DIR = f"iris_artifacts/{datetime.now(tz = ZoneInfo('Asia/Kolkata')).strftime('%Y%m%d_%H%M%S')}"
MODEL_ARTIFACT_DIR

'iris_artifacts/20251106_112617'

### Store the artifacts locally

In [52]:
import pickle
import joblib

! mkdir -p artifacts

joblib.dump(mod_dt, "artifacts/model.joblib")

['artifacts/model.joblib']

### Store the artifacts in Google Cloud Storage Bucket

Before you can deploy your model for serving, Vertex AI needs access to the following files in Cloud Storage:

- `model.joblib` (model artifact)

- `preprocessor.pkl` (model artifact)
    
Run the following commands to upload your files:



In [53]:
# Store output artifacts to google cloud storage bucket
! gsutil cp artifacts/model.joblib {BUCKET_URI}/{MODEL_ARTIFACT_DIR}/

Copying file://artifacts/model.joblib [Content-Type=application/octet-stream]...
/ [1 files][  2.2 KiB/  2.2 KiB]                                                
Operation completed over 1 objects/2.2 KiB.                                      


## 4. Create a new script for inference

Run the inference on eval set after fetching the models from GCS Output Artifacts Bucket



### Copy model from Bucket


In [54]:
! gsutil cp {BUCKET_URI}/{MODEL_ARTIFACT_DIR}/model.joblib .

Copying gs://mlops-lively-nimbus-473407-m9/iris_artifacts/20251106_112617/model.joblib...
/ [1 files][  2.2 KiB/  2.2 KiB]                                                
Operation completed over 1 objects/2.2 KiB.                                      


### Load model


In [55]:
model = joblib.load("./model.joblib")

### Evaluating the Performance

#### Eval set


In [56]:
prediction=model.predict(X_eval)
print('The accuracy of the Decision Tree is',"{:.3f}".format(metrics.accuracy_score(prediction,y_eval)))

The accuracy of the Decision Tree is 1.000


#### Test set


In [57]:
prediction=model.predict(X_test)
print('The accuracy of the Decision Tree is',"{:.3f}".format(metrics.accuracy_score(prediction,y_test)))

The accuracy of the Decision Tree is 1.000


## 5. Run this Training and inference for 2 times resulting in two output artifact folders in Google cloud storage bucket

Run step 3 and 4 again



## 6. (Optional) Run this pipeline for two versions of data provided in github data folder

Change the data in step 2 and the further steps

