# Training Script for DVC using the IRIS CLassifier


## Problem Statement

Incorporating DVC into the week 1 iris pipeline; this version of the notebook does not include direct to gcs bucket commands and instead uses dvc to version track and push into gcs remotely


## Setting Up the configurations



In [1]:
# Vertex SDK for Python
! pip3 install --upgrade --quiet  google-cloud-aiplatform

In [2]:
PROJECT_ID = "gentle-presence-472611-u8"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

### Create a Cloud Storage bucket


In [3]:
BUCKET_URI = f"gs://mlops-course-gentle-presence-472611-u8-v4-unique-week1"  # @param {type:"string"}

In [4]:
! gsutil mb -l {LOCATION} -p {PROJECT_ID} {BUCKET_URI}

Creating gs://mlops-course-gentle-presence-472611-u8-v4-unique-week1/...
ServiceException: 409 A Cloud Storage bucket named 'mlops-course-gentle-presence-472611-u8-v4-unique-week1' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.


### Initialize Vertex AI SDK for Python


In [5]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=LOCATION, staging_bucket=BUCKET_URI)

## Training 

In [8]:
import os
import sys
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from pandas.plotting import parallel_coordinates
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn import metrics
import pickle
import joblib
import time

data_v1 = pd.read_csv("data/v1_data.csv")
data_v2 = pd.read_csv("data/v2_data.csv")

In [22]:
def train(version, data):
    train, test = train_test_split(data, test_size = 0.4, stratify = data['species'], random_state = 42)
    X_train = train[['sepal_length','sepal_width','petal_length','petal_width']]
    y_train = train.species
    X_test = test[['sepal_length','sepal_width','petal_length','petal_width']]
    y_test = test.species
    mod_dt = DecisionTreeClassifier(max_depth = 3, random_state = 1)
    mod_dt.fit(X_train,y_train)
    
    timestamp = time.strftime("%Y%m%d_%H%M%S")
    prediction=mod_dt.predict(X_test)
    out_path = f"outputs/outputs_{version}_{timestamp}.csv"
    pd.DataFrame({f"predicted_{version}": prediction}).to_csv(out_path, index=False)
    print('The accuracy of the Decision Tree is',"{:.3f}".format(metrics.accuracy_score(prediction,y_test)))
    
    timestamp = time.strftime("%Y%m%d_%H%M%S")
    OUTPUT_FOLDER = f"artifacts/model_{version}"
    os.makedirs(OUTPUT_FOLDER, exist_ok=True)

    joblib.dump(mod_dt, f"{OUTPUT_FOLDER}/model_{timestamp}.joblib")
    print("Model saved to artifacts")

In [23]:
train("v1",data_v1)

The accuracy of the Decision Tree is 0.951
Model saved to artifacts


In [None]:
train("v2",data_v2)

### Track model with dvc


### Track the output files using dvc