# Learning Objectives

- Provide a comprehensive understanding of MLOps as a conceptual subset of Development & Operations (DevOps) from software engineering.
- Illustrate how integration of version control using Git enables tracking key artifacts in MLOps pipelines.
- Introduce a MLOps project workflow that implements DevOps CI/CD practises.

# Introduction

In an endeavor to build systems that constantly learn from data, we will need to build a mindset that fosters organizational processes around building model artifacts and gracefully deploying fresher versions of these models to production. One way of implementing these processes is to gain inspiration from the broader the world of software development, and specifically at two ideas from this world:
- Principles of DevOps to build and manage ML systems
- Continuous Integration and Continous Deployment (CI/CD) to automate DevOps

## DevOps -> MLOps

![devops](figures/devops-cycle.png)

[Source](https://www.hiclipart.com/free-transparent-background-png-clipart-pytym)

The DevOps cycle encompasses a set of steps that facilitate the development, deployment, and operation of software systems. In the context of machine learning, these steps can be adapted to create an ML DevOps cycle, which focuses on managing and automating the machine learning workflow. Let us look at each of these steps in detail.

1. *Code*: This stage involves developing and maintaining the machine learning codebase. This includes writing code for data preprocessing, model training, evaluation, and deployment. 

2. *Build*: This stage involves packaging the code and its dependencies into a deployable format. This step ensures that the ML codebase can be easily reproduced and deployed in different environments. 

3. *Test*: This stage focuses on verifying the functionality, quality, and performance of the containerized model. This includes running unit tests (i.e., short-term monitoring checks), integration tests (to ensure the model doesn't break an existing system), and evaluating model performance on test datasets. 

4. *Release*: This stage involves preparing the containerized model for deployment in a production environment. This includes generating deployment artifacts (including a canary environment for release), documenting release notes, and ensuring that all necessary dependencies are included.

5. *Deploy*: In this stage, the containerized model is deployed to a production environment, making it available for serving predictions or integrating with other systems.

6. *Operate*: This stage focuses on monitoring and managing the deployed ML system. This includes logging relevant metrics, handling errors, and ensuring the system's health and availability. Note that at this stage the focus is on logging and measuring business metrics around the deployed ML model.

7. *Monitor*: This involves continuously monitoring the ML system's performance, data quality, and model behavior in a production environment using a series of long-term monitoring checks. This helps identify anomalies, detect drift, and ensure ongoing reliability.

8. *Plan*: This stage focuses on gathering feedback, analyzing performance, and incorporating improvements into future iterations of the ML system. It involves planning for enhancements, bug fixes, and updates.

## Continuous Integration/Continuous Deployment (CI/CD)

CI/CD (Continuous Integration/Continuous Deployment) is a crucial component of the DevOps cycle in the context of machine learning. It focuses on automating the process of integrating code changes, testing, and deploying ML models, ensuring a streamlined and efficient workflow. 


Continuous Integration (CI):
- CI involves automating the integration of code changes made by different developers into a shared repository. It aims to prevent integration conflicts and maintain code quality.
- In the ML context, CI ensures that changes to the ML codebase, including data preprocessing, model training, and evaluation, are automatically integrated into a central repository.
- CI systems, such as AWS CodeBuild, Azure DevOps, GitHub Actions or Jenkins, trigger automated builds and tests whenever changes are pushed to the repository.
- Continuous integration facilitates collaboration, identifies integration issues early, and promotes a consistent and stable codebase.

Continuous Deployment (CD):
- CD extends CI by automating the deployment process, allowing ML models and related components to be deployed to production environments in a reliable and reproducible manner.
- CD enables the automated execution of the machine learning pipeline, including model training, evaluation, packaging, and deployment.
- CD systems leverage CI artifacts and trigger deployment processes based on predefined criteria, such as passing tests or specific branch merges.
- CD ensures that ML models are consistently deployed to production, reducing manual effort and minimizing the risk of human error.
- It enables faster and more frequent deployments, enabling rapid iteration and quicker delivery of ML-based applications.

By leveraging CI/CD practices, ML teams can achieve greater efficiency, collaboration, and reliability in developing and deploying machine learning models. Let us now see the key steps to be followed to execute a CI/CD process for MLOps.

## Key steps in CI/CD for MLOps

An operational CI/CD MLOps system hinges on two key steps:

**Step 1 - Pipeline Delineation** 

A machine learning pipeline defines the sequence of steps required to train, evaluate, and deploy machine learning models. It encompasses various stages, such as data preprocessing, feature engineering, model training, validation, testing, and deployment. The pipeline ensures consistency and reproducibility in the machine learning workflow.

The pipeline can be represented as a series of interconnected components or modules, each responsible for a specific task. The pipeline delineation specifies the order of execution and the dependencies between the components.

**Step 2 - Version Control System for Tracking Pipeline Changes**

In an MLOps workflow, it is crucial to track changes made to the machine learning pipeline, including modifications to code, configurations, and dependencies. A version control system, such as Git, serves as the foundation for tracking and managing these changes. With a version control system, each change made to the pipeline is recorded as a commit, capturing the specific modifications made at a given point in time.

The utility of version control goes beyond tracking changes though. When a version control system detects a change in the pipeline, it can be programmed to trigger the execution of the pipeline by default. This ensures that any modifications to the pipeline automatically initiate the necessary steps for retraining, reevaluation, or redeployment. For example, if a developer makes changes to the preprocessing component of the pipeline, such as adding new data transformations or modifying existing ones, the version control system registers the changes. As a result, the pipeline execution is triggered, rerunning the preprocessing step with the updated logic. Similarly, if a change is made to the model training component, such as using a different algorithm or adjusting hyperparameters, the version control system captures the modifications and initiates the retraining process.

By coupling the version control system with automatic pipeline execution, the MLOps system ensures that any changes to the pipeline are automatically incorporated into the workflow, reducing the manual effort required for execution and achieving a CI/CD workflow.

# Business Use Case

<div class="alert alert-block alert-success">
 
For this session consider the case of predicting machinery failure based on the quality of the machinery and its wear and tear. Such predictions often require an expert intervention on the shopfloor and result in high dependencies on the expert.

The dataset used in this session is hosted on [OpenML](https://www.openml.org/search?type=data&status=active&id=4289033et).

</div>

# ML Pipelines

## Concept

Pipelines are a series of steps that are connected with each other; each step accomplishes a specific machine learning task.

For example, a typical machine learning pipeline might include the following steps:
- Data ingestion and preprocessing: Cleansing, transformation, and feature engineering on raw data.
- Model training: Training machine learning models using a specific algorithm and hyperparameters.
- Model evaluation: Assessing the model's performance using appropriate metrics and validation techniques.
- Model deployment: Deploying the trained model for inference or integration into a production environment.

## Implementation

### Setup

Imports

In [1]:
import sklearn
import joblib

from scipy.stats import randint

from sklearn.datasets import fetch_openml
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, cross_validate, RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier

In [2]:
sklearn.set_config(display='diagram')

In [3]:
print(sklearn.__version__)

1.2.0


### Data

In [4]:
dataset = fetch_openml(data_id=42890, as_frame=True, parser='auto')

The `parser=auto` argument was added in later versions of scikit-learn. Try removing the arugument if the above cell returns an error.

In [5]:
print(dataset.DESCR)

The AI4I 2020 Predictive Maintenance Dataset is a synthetic dataset that reflects real predictive maintenance data encountered in industry. Since real predictive maintenance datasets are generally difficult to obtain and in particular difficult to publish, we present and provide a synthetic dataset that reflects real predictive maintenance encountered in industry to the best of our knowledge.

### Attribute Information:

The dataset consists of 10 000 data points stored as rows with 14 features in columns

- UID: unique identifier ranging from 1 to 10000
- product ID: consisting of a letter L, M, or H for low (50% of all products), medium (30%) and high (20%) as product quality variants and a variant-specific serial number
- air temperature [K]: generated using a random walk process later normalized to a standard deviation of 2 K around 300 K
- process temperature [K]: generated using a random walk process normalized to a standard deviation of 1 K, added to the air temperature plus 10 

In [6]:
maintenance_data = dataset.data

In [7]:
maintenance_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   UDI                      10000 non-null  int64  
 1   Product ID               10000 non-null  object 
 2   Type                     10000 non-null  object 
 3   Air temperature [K]      10000 non-null  float64
 4   Process temperature [K]  10000 non-null  float64
 5   Rotational speed [rpm]   10000 non-null  int64  
 6   Torque [Nm]              10000 non-null  float64
 7   Tool wear [min]          10000 non-null  int64  
 8   Machine failure          10000 non-null  int64  
 9   TWF                      10000 non-null  int64  
 10  HDF                      10000 non-null  int64  
 11  PWF                      10000 non-null  int64  
 12  OSF                      10000 non-null  int64  
 13  RNF                      10000 non-null  int64  
dtypes: float64(3), int64(9)

In [8]:
maintenance_data.head()

Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,0,0,0,0,0
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,0,0,0,0,0
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,0,0,0,0,0
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,0,0,0,0,0
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,0,0,0,0,0


In [9]:
target = 'Machine failure'
numeric_features = [
    'Air temperature [K]', 
    'Process temperature [K]', 
    'Rotational speed [rpm]',
    'Torque [Nm]',
    'Tool wear [min]'
]
categorical_features = ['Type']

In [10]:
maintenance_data.loc[:, numeric_features].describe()

Unnamed: 0,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min]
count,10000.0,10000.0,10000.0,10000.0,10000.0
mean,300.00493,310.00556,1538.7761,39.98691,107.951
std,2.000259,1.483734,179.284096,9.968934,63.654147
min,295.3,305.7,1168.0,3.8,0.0
25%,298.3,308.8,1423.0,33.2,53.0
50%,300.1,310.1,1503.0,40.1,108.0
75%,301.5,311.1,1612.0,46.8,162.0
max,304.5,313.8,2886.0,76.6,253.0


In [11]:
maintenance_data.loc[:, categorical_features].describe()

Unnamed: 0,Type
count,10000
unique,3
top,L
freq,6000


### Model Pipeline

In [12]:
maintenance_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   UDI                      10000 non-null  int64  
 1   Product ID               10000 non-null  object 
 2   Type                     10000 non-null  object 
 3   Air temperature [K]      10000 non-null  float64
 4   Process temperature [K]  10000 non-null  float64
 5   Rotational speed [rpm]   10000 non-null  int64  
 6   Torque [Nm]              10000 non-null  float64
 7   Tool wear [min]          10000 non-null  int64  
 8   Machine failure          10000 non-null  int64  
 9   TWF                      10000 non-null  int64  
 10  HDF                      10000 non-null  int64  
 11  PWF                      10000 non-null  int64  
 12  OSF                      10000 non-null  int64  
 13  RNF                      10000 non-null  int64  
dtypes: float64(3), int64(9)

In [13]:
X = maintenance_data.drop(
    columns=[target, 'UDI', 'Product ID', 'TWF', 'HDF', 'PWF', 'OSF', 'RNF']
)
y = maintenance_data[target]

In [14]:
Xtrain, Xtest, ytrain, ytest = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

In [15]:
Xtrain.to_csv('data/20230921_training_features.csv', index=False)
ytrain.to_csv('data/20230921_training_target.csv', index=False)

In [16]:
Xtest.to_csv('data/20230921_test_features.csv', index=False)
ytest.to_csv('data/20230921_test_target.csv', index=False)

**A key point of difference that needs to be taken into account when planning for deployment is to ensure that version control is extended to data as well.**

Versioning data is an equally important component of building models with deployment as an end goal. Often, when debugging models post deployment, we need to understand which specific version of the dataset was used to build the model. Deviation between training data and live (production) data is often a key reason behind performance degradation post deployment.

Remember that the model we build is eventually going to be deployed for production. In this context, it is important to include all the preprocessing that you expect to do with the model itself. `scikit-learn` has several production-ready capabilities to help with this workflow. A common production-aware training workflow is to use a [model building pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html). When we build a model pipeline, we list all the steps needed to be performed, that is, preprocessing and model estimation. Think of a model pipeline as an interface that accepts inputs in exactly the same way as a customer would enter it and makes a prediction accounting for every specific detail that needs to be performed in the interim.

#### Handling preprocessing

The [`make_column_transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html) is a great way to specify the preprocessing steps (e.g., scaling, missing value imputation) to be conducted for specific features. 

The `make_column_transformer` takes tuples of preprocessing steps and columns on which the preprocessing needs to be applied. It then applies these transformations on the specified features in the specified order, storing the required parameters from training data and applying these on the validation/test data. In sum, this transformer automates the `fit` and `transform` methods during the training and prediction steps simplifying the modeling process.

Let us look at a column transformation preprocessor in action.

In [17]:
preprocessor = make_column_transformer(
    (StandardScaler(), numeric_features),
    (OneHotEncoder(handle_unknown='ignore'), categorical_features)
)

**Handling numeric features**

Common transformations on numeric features include scaling and missing value imputation. We create column transformers by arranging these transformations in the sequence we want them to be applied. In handling these tranformation, we should be careful that we are not using the statistics computed (e.g., mean) of the training data to be propogated into transformations of test data. Column transformers provide a way to handle this in a scaleable, yet transparent manner.

**Handling categorical features**

Categorical features should be handled appropriately using a [one-hot encoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). The default behavior is to exclude features which are not explicitly dealt with the preprocessor. This allows us to keep the training data separate from the preprocessing pipeline.

This is important because in a production setting, the model might be exposed to categories it has not seen in the training phase. We will need a robust method to return meaningful predictions even in this case.

**Handling both numeric & categorical features** 

The preprocessor in the code listing above has two steps depending on the type of the column. If the feature is numeric it passes through the scaler and if it is categorical it passes through the one-hot encoder. The argument `handle_unknown='ignore'` ensures that if new levels are encountered, they are encoded as 0's.

Once we have the preprocessing pipeline in place, we then create a *model pipeline* that incorporates preprocessing as a step in the model itself. This is a crucial step. Once deployed customers will pass input variables to the model in the original scale of the variables. If there is any preprocessing that is needed for the model to make a prediction, it needs to be packaged together with the model.

**Assembling the pipeline**

Let us now build a model pipeline assembling all the steps from the input data to the model we wish to estimate.

In [18]:
model_pipeline = make_pipeline(
    preprocessor, 
    GradientBoostingClassifier()
)

In [19]:
model_pipeline

In [20]:
model_pipeline.fit(Xtrain, ytrain)

Notice how the model pipeline elegantly handles scaling numeric predictors and on-hot encoding categorical features. This transformed features are passed on to a `Gradient Boosting Classifer` for estimation.

Since a model pipeline stores individual stages of the pipeline as a named step, we can dig deeper into the model pipeline by accessing these steps. We can inspect the individual steps by probing the execution steps that is being conducted by each step.

In [21]:
model_pipeline.named_steps

{'columntransformer': ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
                                  ['Air temperature [K]',
                                   'Process temperature [K]',
                                   'Rotational speed [rpm]', 'Torque [Nm]',
                                   'Tool wear [min]']),
                                 ('onehotencoder',
                                  OneHotEncoder(handle_unknown='ignore'),
                                  ['Type'])]),
 'gradientboostingclassifier': GradientBoostingClassifier()}

In [22]:
model_pipeline.named_steps['columntransformer'].transformers_

[('standardscaler',
  StandardScaler(),
  ['Air temperature [K]',
   'Process temperature [K]',
   'Rotational speed [rpm]',
   'Torque [Nm]',
   'Tool wear [min]']),
 ('onehotencoder', OneHotEncoder(handle_unknown='ignore'), ['Type'])]

In [23]:
model_pipeline.named_steps['columntransformer'].transformers_[1][1].get_feature_names_out()

array(['Type_H', 'Type_L', 'Type_M'], dtype=object)

As we can see from the output above, the features are being transformed through a sequence of `named_steps` with all the heavy lifting done by the `column_transformer`. 

#### Hyperparameter tuning

Baseline

In [24]:
cross_val_accuracy = cross_validate(
    model_pipeline, Xtrain, ytrain, cv=3, return_train_score=True
)

In [25]:
print(f"Training accuracy: {cross_val_accuracy['train_score'].mean()}")
print(f"Validation accuracy: {cross_val_accuracy['test_score'].mean()}")

Training accuracy: 0.9938125233876987
Validation accuracy: 0.9834998588417259


Tuning

In [26]:
model_gbr = GradientBoostingClassifier(
    max_depth=3, 
    n_estimators=100, 
    random_state=42
)

In [27]:
model_pipeline = make_pipeline(
    preprocessor, 
    model_gbr
)

In [28]:
model_pipeline.named_steps

{'columntransformer': ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
                                  ['Air temperature [K]',
                                   'Process temperature [K]',
                                   'Rotational speed [rpm]', 'Torque [Nm]',
                                   'Tool wear [min]']),
                                 ('onehotencoder',
                                  OneHotEncoder(handle_unknown='ignore'),
                                  ['Type'])]),
 'gradientboostingclassifier': GradientBoostingClassifier(random_state=42)}

In [29]:
param_distrib = {
    "gradientboostingclassifier__max_depth": randint(3, 12),
    "gradientboostingclassifier__n_estimators": randint(100, 1000)
}

In [30]:
rand_search_cv = RandomizedSearchCV(
    model_pipeline,
    param_distrib,
    n_iter=10,
    cv=3,
    random_state=42
)

In [31]:
rand_search_cv.fit(Xtrain, ytrain)

In [32]:
rand_search_cv.best_estimator_

In [33]:
rand_search_cv.best_score_

0.9861247182811826

Now we have a trained model in hand. But at this stage it is sill a Python object. This means that it is embedded within the Python runtime that the model was trained in (i.e., the local compute). To be able to receive inputs from customers this Python object needs to be stored in a format that can be readily copied without dependence on the Python runtime.

In [34]:
saved_model_path = "models/model-v1.joblib"

In [35]:
joblib.dump(rand_search_cv.best_estimator_, saved_model_path)

['models/model-v1.joblib']

Crucial components of a model are often referred to as *artefacts*. In this case, the model artefact is the estimated model pipeline.

The model is now saved in a binary object (`.joblib`) and the process of converting the model (Python object) to a binary format is referred to as *serialization. 

In [36]:
saved_model = joblib.load('models/model-v1.joblib')

In [37]:
saved_model

# Version Control with `git`

## Introduction

Version control is essential in an MLOps workflow facilitates collaboration between different members of a team who might be working on different parts of a same project. Each such part can be developed as a branch and merged into the main codebase when the development on that branch is complete. 

While `git` is one such open source version control system that can be used to track local code changes, managed code repository services like `GitHub` enable remote hosting and allow contributions between team members to be updated seamlessly. In complex ML pipelines, issues and bugs are inevitable. Version control systems offer the capability to revert to previous working states, providing a safety net for rollbacks and bug fixes. Git allows teams to roll back to a specific commit or create a new branch to fix issues while preserving the integrity of the codebase. It ensures that previous working versions can be easily restored, preventing potential disruptions in the pipeline.

In the context of executing CI/CD DevOps pipelines, we now illustrate two important workflows while working with `git` using a public GitHub repository.

## The `clone` - `commit` - `push` workflow

![git-workflow](figures/git-workflow.drawio.png)

The clone-commit-push workflow is a common Git workflow for working with remote repositories. It involves cloning a repository, making changes, committing those changes, and pushing them back to the remote repository. Here's a step-by-step explanation with examples:

1. Clone the Repository:
   - To clone a remote repository, use the `git clone` command followed by the repository's URL.
   - Example:
     ```
     git clone https://github.com/example-user/example-repo.git
     ```
   - This creates a local copy of the remote repository on your machine.

2. Make Changes:
   - Change into the cloned repository's directory:
     ```
     cd example-repo
     ```
   - Make the necessary changes to the files in the repository using any text editor or IDE.

3. Commit Changes:
   - Stage the changes you want to commit using the `git add` command.
   - Example:
     ```
     git add modified_file.py
     ```
   - Commit the staged changes with a meaningful commit message using the `git commit` command.
   - Example:
     ```
     git commit -m "Update modified_file.py with new feature"
     ```

4. Push Changes:
   - Push the committed changes to the remote repository using the `git push` command.
   - Example:
     ```
     git push origin master
     ```
   - This command pushes the committed changes to the `master` branch of the remote repository named `origin`.

Note: The `origin` is the default name of the remote repository. You can replace it with the appropriate name if your remote repository has a different name. For `GitHub` the `master` branch is called `main`, hence we say `git push origin main` if the remote repository is hosted on GitHub.

5. Pull Changes (Optional):
   - If you are working in a team or collaborating with others, it's a good practice to pull the latest changes from the remote repository before making your own changes.
   - Use the `git pull` command to fetch and merge the latest changes from the remote repository.
   - Example:
     ```
     git pull origin master
     ```
   - This ensures that your local repository is up to date with the remote repository before you make your own modifications.

The clone-commit-push workflow allows you to work on your local repository, make changes, commit them with informative messages, and push them back to the remote repository. It facilitates collaboration, version control, and the seamless integration of changes into the project.

## Branching

![git-branches](figures/git-branch.drawio.png)

In a collaborative development environment where multiple developers are working on different aspects of a proposed change, Git branching is a valuable feature that enables an organized and efficient workflow. Each developer can work on a separate branch, allowing them to make independent changes without interfering with each other's work. Here's an explanation of the usage of Git branching in this context:

1. Creating Branches:
   - Each developer starts by creating their own branch, typically based on the main branch (e.g., "master" or "main").
   - Developers can use the `git branch` command to create a new branch or the `git checkout -b` command to create and switch to a new branch in one step.
   - Example:
     ```
     git branch feature-branch
     ```
     or
     ```
     git checkout -b feature-branch
     ```
     If we want to switch to an existing branch, we use `git checkout existing-feature-branch`

2. Working on Branches:
   - Each developer now works on their respective branch, focusing on their specific tasks or changes.
   - They can make changes, add new features, fix bugs, or modify code without affecting the main branch or other developers' work.
   - Developers commit their changes locally to their branch using the standard `git add` and `git commit` commands.

3. Sharing Branches:
   - Developers can push their local branches to a shared remote repository to collaborate with others.
   - They use the `git push` command with the branch name and the remote repository to push their branch.
   - Example:
     ```
     git push origin feature-branch
     ```

4. Reviewing and Merging Changes:
   - Once a developer has completed their work on the branch, they can create a pull request or merge request, depending on the Git hosting platform being used (e.g., GitHub, GitLab, Bitbucket).
   - The pull request allows other developers to review the changes, provide feedback, and discuss the proposed changes before merging them into the main branch.
   - After the review and approval process, the changes from the branch can be merged into the main branch using the platform's interface.

5. Updating Branches:
   - During the development process, other developers may make changes to the main branch.
   - To incorporate those changes into their branch, developers can perform a branch update by switching to their branch and using the `git merge` command or `git rebase` command to integrate the latest changes from the main branch.
   - Example:
     ```
     git checkout feature-branch
     git merge main
     ```

By utilizing Git branching, developers can work on different aspects of a proposed change simultaneously, without interfering with each other's work. It allows for parallel development, easy collaboration, and the ability to review, discuss, and merge changes in an organized manner.

# Automating the development workflow

# Code -> Build

## Code repositories

There are two important benefits in attaching a code folder (implementing a machine learning pipeline, for example) to a version control system like `git` and hosting it on a remote location such as GitHub or AWS CodeCommit.

1. Ownership: With version control, each commit is associated with an author, allowing transparency in changes. This enables teams of data scientists to have a transparent record of contributions and allows for better collaboration and accountability. For example, if a data scientist encounters an issue with a specific piece of code, they can easily identify the author of that code and seek their assistance in resolving the problem.

2. Tracking Changes: Version control systems provide a comprehensive history of code changes. Every commit is recorded, along with the changes made, enabling easy tracking of modifications over time. This allows teams to understand the evolution of the codebase, identify when and why certain changes were introduced, and quickly revert to previous versions if necessary. For instance, if a bug is discovered in the latest version of the code, the team can examine the commit history to identify the specific change that introduced the bug and revert it to a previous working version.

## Triggers

While tracking and managing the versions of code is one part of the CI/CD process, another key component is the ability to automatically trigger the execution of code once the data science team lands a `push` or `merge` to the code repository. These triggers initiate the creation of a `build` from a folder of code. In this context, a `build` refers to a tested, version-controlled artifact that represents the outcome of the code.

### GitHub Actions

GitHub Actions is a powerful automation platform provided by GitHub that allows developers to create custom workflows to automate various tasks in their software development processes. At its core, GitHub Actions revolves around the concept of workflows. A workflow is a configurable set of steps that can be triggered by specific events, such as a push to the repository, a pull request, or a scheduled time. Each step in a workflow represents a specific action, which can be a command-line script, a shell command, or an external action defined by the community.

Workflows are defined using YAML files that reside within the repository. YAML (YAML Ain't Markup Language) is a human-readable data serialization format. It's designed to be easy to understand and write by both humans and machines. YAML is a way to structure and organize data in a format that is clear and concise; in this case we use YAML to specify the configuration of the the trigger that will be stored in a `*.yml` (or `.yaml`) file. These files describe the series of actions to be executed, their dependencies, and the conditions under which they should run. This declarative approach allows for easy configuration and version control of the workflows. YAML files have application beyond the specification of workflows (we will see them again as a neat method to create Azure ML pipelines as an alternative to the decorators ).

When a triggering event occurs, GitHub Actions spins up a virtual environment, called a runner, to execute the defined workflow. Runners can be hosted by GitHub or self-hosted on your own infrastructure. They provide the necessary execution environment, including the required operating system, programming languages, and tools.

Before we move on to specify GitHub Actions using YAML, here is a summary of key YAML rules:

1. Indentation: YAML relies on indentation for structure, so proper indentation is crucial. Use spaces (not tabs) for indentation, and maintain consistent indentation levels throughout the file. Typically, two spaces or four spaces are used for each level of indentation.

Example:
```yaml
key:
  subkey1: value1
  subkey2: value2
```

2. Key-Value Pairs: YAML uses a key-value pair structure. Each key is followed by a colon (:), and the associated value is placed on the next line with proper indentation. Avoid using tabs or special characters in keys.

Example:
```yaml
name: John Doe
age: 30
```

3. Lists and Arrays: YAML supports lists and arrays. To define a list, use a hyphen (-) followed by a space before each list item. All items in the list should be indented at the same level.

Example:
```yaml
fruits:
  - Apple
  - Banana
  - Orange
```

4. Comments: YAML allows comments to provide additional context or explanations. Comments start with the hash symbol (#) and can appear on their own line or at the end of a line. Comments are ignored by the YAML parser.

Example:
```yaml
# This is a comment
key: value  # Inline comment
```

5. Quoting Strings: YAML generally does not require quotes around strings. However, if a string contains special characters or spaces, it should be enclosed in single quotes (' ') or double quotes (" ").

Example:
```yaml
message: 'Hello, World!'
description: "This is a YAML file."
```

Now that the rules of YAML are in place, below is a trigger that runs a specified file on the code repository when a commit is made. This run, that is, the execution of a Python script performing a specific action makes up the `build` step.

```yaml
name: run-training-file-on-commit
on:
  push:
    branches:
    - main
  pull_request:
    branches:
    - main

jobs:
  build:
    runs-on: ubuntu-latest

    permissions:
      # Give the default GITHUB_TOKEN write permission to commit and push the
      # added or changed files to the repository.
      contents: write

    steps:
      - name: Checkout repository content
        uses: actions/checkout@v4 # Checkout the repository content to github runner.

      - name: Setup Python version
        uses: actions/setup-python@v4
        with:
          python-version: 3.9 # Install the python version needed

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          python -m pip install -r requirements.txt   

      - name: Execute training script # Run the specified file from the repo
        run: python train.py

      - name: Persist artifacts to the repo
        uses: stefanzweifel/git-auto-commit-action@v4
```

GitHub Actions are typically used to run tests and generate a build (i.e., a persisted model object) from a model training workflow.