# Deploy Defect Prediction pipeline (first time)

## Create Machine Learning workspace
1. Create a resource in Azure portal
2. Search for Machine Learning
3. Create new resource
2. To view new workspace, select Go to resource

## Create VM to run notebook
1. Sign in to Azure Machine Learning
2. Select Notebooks in user section and clone this notebook
3. Add a compute and define name (can take a couple of minuts)
4. Once the VM is available it will be displayed in the top toolbar


# Run notebook

In [32]:
!pip install iacminer
!pip install PyGithub
!pip install ansiblemetrics
!pip install pydriller
!pip install iacminer
!pip install python-dotenv

Collecting iacminer
  Downloading iacminer-0.1.tar.gz (19 kB)
Building wheels for collected packages: iacminer
  Building wheel for iacminer (setup.py) ... [?25ldone
[?25h  Created wheel for iacminer: filename=iacminer-0.1-py3-none-any.whl size=18610 sha256=20ea85ccd3a4466ac050dc5eb31b122b88ede55abde59fc19ee36c9bd884b1f5
  Stored in directory: /home/azureuser/.cache/pip/wheels/24/97/c4/7a6e7ae264dc146aef2e9b3524d603811cd45a0532767df571
Successfully built iacminer
Installing collected packages: iacminer
Successfully installed iacminer-0.1


In [98]:
#import Python packages
import pandas as pd
import azureml.core
import os

from azureml.core.compute import ComputeTarget, DataFactoryCompute
from azureml.exceptions import ComputeTargetException
from azureml.core import Workspace, Datastore, Dataset
from azureml.pipeline.core import Pipeline
from azureml.core.datastore import Datastore
from azureml.data.data_reference import DataReference
from azureml.pipeline.steps import DataTransferStep

#IaC miner
import os
import iacminer
from datetime import datetime
from iacminer.miners.github import GithubMiner
from iacminer.miners.repository import RepositoryMiner

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.6.0


## Connect workspace 

1. Create a workspace object from the existing workspace. 
2. Copy the authentication code on provide link after running following code

A Workspace is a class that accepts your Azure subscription and resource information. It also creates a cloud resource to monitor and track your model runs. Workspace.from_config() reads the file config.json and loads the authentication details into the object (ws).

In [36]:
from azureml.core import Workspace
ws = Workspace.from_config()
print(ws.name, ws.location, ws.resource_group, sep='\t')

Defect_Prediction	westeurope	Defect_Prediction_ML


## Create experiment

An experiments represents a collection of infividual model runs. Parameters include your workspace reference, and a string name for the experiment.

In [37]:
experiment_name = "Defect-Prediction-test-experiment"

from azureml.core import Experiment
experiment = Experiment(workspace=ws, name = experiment_name)

## Create a datastore

Call register_azure_blob_container() to make the data available to the workspace. Then, set the workspace default datastore as the output datastore. Use the output datastore to score output in the pipeline.

In [39]:
from azureml.core.datastore import Datastore

blob_datastore = Datastore.register_azure_blob_container(ws, 
                      datastore_name="iac_datastore", 
                      container_name="iac-script-data", 
                      account_name="defect_prediction_tool", 
                      overwrite=True)

def_data_store = ws.get_default_datastore()

## Load data 

Get the GitHub repositories and save labeled files (defect-prone and defect-free). 

### Mine GitHub

In [91]:
import os
from datetime import datetime
from iacminer.miners.github import GithubMiner

#48047f330a58493aef60f8c56f36831960b91eb6

miner = GithubMiner(access_token = os.getenv('48047f330a58493aef60f8c56f36831960b91eb6'),
                    date_from = datetime.strptime('2020-01-01 00:00:00', '%Y-%m-%d %H:%M:%S'),
                    date_to = datetime.strptime('2020-01-02 00:00:00', '%Y-%m-%d %H:%M:%S'),
                    pushed_after=datetime.strptime('2020-06-07 00:00:00', '%Y-%m-%d %H:%M:%S'),
                    min_stars = 0, # (default = 0)
                    min_releases = 0, # (default = 0)
                    min_watchers = 0, # (default = 0)
                    min_issues = 0, # (default = 0)
                    primary_language = None, # e.g., 'python' (default = None)
                    include_fork = False) # (default = False)

for repository in miner.mine():
    print(repository)

Query failed to run by returning code of 401. { search(query: "is:public stars:>=0 mirror:false archived:false created:2020-01-01T00:00:00Z..2020-01-02T00:00:00Z 
pushed:>=2020-06-07T00:00:00Z", type: REPOSITORY, first: 50 ) { repositoryCount pageInfo { endCursor startCursor 
hasNextPage } edges { node { ... on Repository { id defaultBranchRef { name } owner { login } name url description 
primaryLanguage { name } stargazers { totalCount } watchers { totalCount } releases { totalCount } issues { 
totalCount } createdAt pushedAt updatedAt hasIssuesEnabled isArchived isDisabled isMirror isFork object(expression: 
"master:") { ... on Tree { entries { name type } } } } } } } 

    rateLimit {
        limit
        cost
        remaining
        resetAt
    }
}



### Mine respositories

In [95]:
miner = RepositoryMiner(access_token = os.getenv('48047f330a58493aef60f8c56f36831960b91eb6'),
                        path_to_repo='path/to/cloned/repository',
                        branch='development') # Optional (default 'master')

# Get only fixing commits by analyzing issues
fix_from_issues = miner.get_fixing_commits_from_closed_issues()
for sha in fix_from_issues:
    print(sha)

# Get only fixing commits by analyzing commit messages
fix_from_commits = miner.get_fixing_commits_from_commit_messages()
for sha in fix_from_commits:
    print(sha)

# Get all Ansible files touched by fixing commits
miner.set_fixing_commits() # Must call this method first
fixing_files = miner.get_fixing_files()

# Get files labeled as 'defect-prone' or 'defect-free'
labeled_files = miner.label(fixing_files)

# Execute the previous methods at once and extract metrics from labeled files on a per-release basis
for metrics in miner.mine():
    print(metrics)

NoSuchPathError: /mnt/batch/tasks/shared/LS_root/mounts/clusters/sdg-compute1-test/code/users/r.drubbel/path/to/cloned/repository

### Combine GithubMiner and RepositoryMiner

In [90]:
gh_miner = GithubMiner(access_token = os.getenv('GITHUB_ACCESS_TOKEN'),
                    date_from = datetime.strptime('2020-01-01 00:00:00', '%Y-%m-%d %H:%M:%S'),
                    date_to = datetime.strptime('2020-01-02 00:00:00', '%Y-%m-%d %H:%M:%S'),
                    pushed_after = datetime.strptime('2020-06-07 00:00:00', '%Y-%m-%d %H:%M:%S'),
                    min_stars = 0, # (default = 0)
                    min_releases = 0, # (default = 0)
                    min_watchers = 0, # (default = 0)
                    min_issues = 0, # (default = 0)
                    primary_language = None, # e.g., 'python' (default = None)
                    include_fork = False)
                    

for repository in gh_miner.mine():
    print(repository)
    repo_miner = RepositoryMiner(access_token = os.get_env('48047f330a58493aef60f8c56f36831960b91eb6'),
                                 path_to_repo='path/to/cloned/repository',
                                 branch='development') # Optional (default 'master')

Query failed to run by returning code of 401. { search(query: "is:public stars:>=0 mirror:false archived:false created:2020-01-01T00:00:00Z..2020-01-02T00:00:00Z 
pushed:>=2020-06-07T00:00:00Z", type: REPOSITORY, first: 50 ) { repositoryCount pageInfo { endCursor startCursor 
hasNextPage } edges { node { ... on Repository { id defaultBranchRef { name } owner { login } name url description 
primaryLanguage { name } stargazers { totalCount } watchers { totalCount } releases { totalCount } issues { 
totalCount } createdAt pushedAt updatedAt hasIssuesEnabled isArchived isDisabled isMirror isFork object(expression: 
"master:") { ... on Tree { entries { name type } } } } } } } 

    rateLimit {
        limit
        cost
        remaining
        resetAt
    }
}



## Load labeled files in datastore

Maybe first create a data folder to save the labeled files. To save the labeled files it should be saved as a dataset. A reference will be created to the data source location, along with a copy of its metadata.

In [96]:
data_folder = os.path.join(os.getcwd(), 'data')
os.makedirs(data_folder, exist_ok=True)

# create a FileDataset pointing to files in 'data_folder' folder and its subfolders recursively
labeled__iac_scripts = Dataset.File.from_files(path=data_folder)

Register labeled datasets with a workspace. Use the register() method to register datasets with your workspace in order to share them with others and reuse them across experiments in your workspace. For now this is done in the default workspace

In [None]:
labeled_iac_scripts = labeled_iac_scripts.register(workspace=ws,
                                 name='labeled_iac_scripts',
                                 description='Labeled IaC scripts after mining the repositories',
                                 create_new_version = True)

In [None]:
#May display some values

## Prepare for training

The labeled_files are csv files? \
What is information is stored in the labeled_files? \
What features should be used for training? \
Should the data be tranformed? In what way? \
What wil be the train and test split? \
How are the labels defined? Or how can they be defined?

## Train a model

#### First define training setting

What ML technique will be used? It is also possible to make use of autML and compare results afterwards\
Or shall I try different models? \
In future deployment it is possible to run models in parallel 

Create a directory\
Create a training script\
Create an estimator object\
Submit the job

## Training results

Following the Link to Azure Machine Learning studio to see the experiment results of all individual runs. Navigate to the Outputs + logs tab, and you see the .pkl file for the model that was uploaded to the run during each training iteration.

In [17]:
experiment

Name,Workspace,Report Page,Docs Page
Defect-Prediction-test-experiment,Defect_Prediction,Link to Azure Machine Learning studio,Link to Documentation


In [None]:
local_run = experiment.submit("ML_model", show_output=True)
from azureml.widgets import RunDetails
RunDetails(run).show()

In [None]:
run.wait_for_completion(show_output=False)  # specify True for a verbose log

In [None]:
print(run.get_metrics())

## Rigester model

In [None]:
best_run, fitted_model = run.get_output()
print(best_run)
print(fitted_model)

In [None]:
print(run.get_metrics())
print(run.get_file_names())

In [None]:
# register model
model = run.register_model(model_name='MODEL_name',
                           model_path='outputs/model_name.pkl')
print(model.name, model.id, model.version, sep='\t')

## Deploy model

Use/download the pretrained model 

Create the scoring script, called score.py, used by the web service call to show how to use the model. When the pipelines are created parallel runs can be made.

(Build the pipeline)

Create a deployment configuration file and specify the number of CPUs and gigabyte of RAM needed for your ACI container.

Deploy the model to ACI and build an HTTP POST request to the endpoint. 

### Remarks

In [106]:
datastores = ws.datastores
for name, datastore in datastores.items():
    print(name, datastore.datastore_type)

workspacefilestore AzureFile
workspaceblobstore AzureBlob


In [102]:
datastore = ws.get_default_datastore()
datastore

{
  "name": "workspaceblobstore",
  "container_name": "azureml-blobstore-04d9ce01-ba41-41e9-a6ff-ecc58abfad5d",
  "account_name": "defectpredicti3405648290",
  "protocol": "https",
  "endpoint": "core.windows.net"
}