# DVC & MLFlow Classification Workflow

### Set up the environment

You can set up the environment through any of the following ways;

##### Through the exported YAML file

If you're using anaconda, you just need to import the provided `conda_env.yml` file. You can import it by running the following command:

```
$  conda env create -f conda_env.yml
```


#### Through the Python requirements file

If you don't have anaconda installed, you can import all the required python dependecies by running:

```
$  pip install -r requirements.txt
```


### Running workflow stages through the Jupyter Notebook

You can use the provided `ML Project Management with Git, DVC, and MLFlow.ipynb` to run individual pipeline stages. The global variables can be set in the `src/config.py` or by passing them as arguments through individual function calls.

**Note:** For training stages to work properly through the notebook, the `final_model` flag in the `config.py` file needs to be set to `False`. The flag indicates to the DVC pipeline that model exploration has been completed and that the pipeline run should result in a model that needs to be registered and published.


### Running workflow stages through the shell

You can also run individual stages via the commandline. You should just execute which ever stage you want by running its respective `py` file. For example, you can train a model by running;

```
$  python src/train.py
```

All config paramaters including the model type and hyperparameters will be set through `config.py` in this case. A future release of this workflow will include passing these parameters as shell arguments.


### Running the pipeline via DVC

Once you're happy with the model, you can run the entire pipeline by running the following command in shell:

```
$  dvc repro
```

### Visualizing the Pipeline DAG

You can visualize the end-to-end pipeline configured by running

```
$  dvg dag
```

### View MLFlow Tracking UI

The main benefit of this workflow is the ability to track model artifacts and analyze them via an easy-to-use interface. These artifacts are recorded and tracked by MLFlow and can be analyzed by their Web UI. To run the UI server;

```
$  cd src
$  mlflow ui
```
**Note:** You must be in the `src` directory before you launch the UI because all the MLFlow data is stored in `src/mlruns`. Future releases of this workflow will have the functionality to store MLFlow artifacts to a custom location.

In [1]:
import os
os.chdir('../src')

In [2]:
# Import the Workflow API
import preprocessing, split, train, evaluate

In [3]:
# Set global variables
path = '../data'

In [4]:
# Run the feature engineering stage
# Warning: This stage takes ~2 hours to run.
processed_features = preprocessing.preprocessing (path)

In [5]:
# processed features are stored in data/features. If there has been no change in the original data, 
# previous features can be loaded directly.
import pandas as pd

processed_features = pd.read_csv('../data/features/features.csv')

In [6]:
# View/Analyze the processed features
processed_features.head()

Unnamed: 0,power,pitch_mean,pitch_sd,voiced_fr,tempogram,label
0,0.011559,181.337028,49.182666,0.875486,0.160427,Harry
1,0.030912,154.289764,23.416716,0.888167,0.160363,Harry
2,0.028044,215.548706,37.241349,0.684276,0.194186,Harry
3,0.050587,336.075273,61.21465,0.911296,0.247447,Harry
4,0.028518,416.962326,26.971695,0.539376,0.171848,Harry


In [7]:
# Analyze the features dataframe
processed_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1669 entries, 0 to 1668
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   power       1669 non-null   float64
 1   pitch_mean  1669 non-null   float64
 2   pitch_sd    1669 non-null   float64
 3   voiced_fr   1669 non-null   float64
 4   tempogram   1669 non-null   float64
 5   label       1669 non-null   object 
dtypes: float64(5), object(1)
memory usage: 78.4+ KB


In [8]:
# Analyze the features dataframe
processed_features.describe()

Unnamed: 0,power,pitch_mean,pitch_sd,voiced_fr,tempogram
count,1669.0,1669.0,1669.0,1669.0,1669.0
mean,0.039,261.2614,40.711915,0.7364,0.181999
std,0.046664,101.69724,15.927097,0.134755,0.054971
min,0.00032,0.0,0.0,0.0,0.053416
25%,0.019316,172.777705,29.487238,0.670008,0.141208
50%,0.030007,239.306226,37.649244,0.761447,0.184042
75%,0.044822,378.390913,49.995933,0.827934,0.222201
max,0.676868,432.385272,134.570905,0.990399,0.308246


In [9]:
# Split the data
train_df, test_df = split.simple_split(processed_features, 0.7)


15% of the data has been stored as the Blind Holdout. It is available in /src/feature_store.


In [10]:
# View the splits
print ('Training Dataframe:', train_df.shape)
print ('Testing Dataframe:', test_df.shape)

Training Dataframe: (993, 6)
Testing Dataframe: (426, 6)


In [11]:
# Train a model
# Results of this training will be logged to MLFlow. They can be analyzed by navigating 
# to http://localhost:5000 in your browser. Make sure the MLFlow UI server is running as described in the instructions above.

# In a Decision Tree, the hyperparameter is max_depth

model, run_id = train.train_model ('dt', train_df, test_df, 3)





Testing accuracy: 0.6267605633802817


In [12]:
# Do hyperparameter tuning
# Results of this training will be logged to MLFlow. They can be analyzed by navigating 
# to http://localhost:5000 in your browser. Make sure the MLFlow UI server is running as described in the instructions above.

max_depth = [2, 5, 8, 4, 9]

for i in max_depth:
    train.train_model ('dt', train_df, test_df, i)

Testing accuracy: 0.6408450704225352
Testing accuracy: 0.636150234741784
Testing accuracy: 0.6173708920187794
Testing accuracy: 0.6244131455399061
Testing accuracy: 0.6267605633802817


In [13]:
# Train a different model
# Results of this training will be logged to MLFlow. They can be analyzed by navigating 
# to http://localhost:5000 in your browser. Make sure the MLFlow UI server is running as described in the instructions above.

# In a Random Forest, the hyperparameter is n_estimators

model, run_id = train.train_model ('rf', train_df, test_df, 100)

Testing accuracy: 0.6995305164319249


In [14]:
# Train a different model
# Results of this training will be logged to MLFlow. They can be analyzed by navigating 
# to http://localhost:5000 in your browser. Make sure the MLFlow UI server is running as described in the instructions above.

# In a SVM, the hyperparameter is C (the L2 Regularization Penalty)

model, run_id = train.train_model ('svm', train_df, test_df, 2)

Testing accuracy: 0.5070422535211268


In [15]:
# Evaluate your model on the final blind holdout (validation) dataset

blindholdout_data = pd.read_csv('../src/feature_store/blind_holdout.csv')
blindholdout_data.head()

evaluate.evaluate_validation (run_id, blindholdout_data)
