# 04 Continous X

> The most powerful tool we have as developers is automation.” ~ Scott Hanselman

![continuous_x](https://miro.medium.com/max/4800/1*WsbCzcT3HWRJ4SspEl-qdg.jpeg)

**Source:** Found on a [Medium](https://devopsdiaries.williamtsoi.net/continuous-delivery-in-cartoons-d67bbd6b6954) post where the author credits Nhan Ngo for the creation of this image.

## Table of Contents

1. [Overview](#1.-Overview)
2. [Learning Outcomes](#2.-Learning-Outcomes)
3. [Tools](#3.-Tools)
4. [Version Control]()
    - Git
    - dvc
5. [Continuous Machine Learning](#3.-Testing-Code)
    - GitHub Actions
        - pytest-cov
    - CML
7. [Summary](#7.-Summary)

## 1. Overview

![automation](https://imgs.xkcd.com/comics/automation_2x.png)


At the heart of DevOps lies automation. While the word rolls off the tongue quite easily, accomplishing it in the software engineering world can be quite challenging. To manage the complexity, engineers have developed a system called, Continuous Delivery (CD). At its core, 

## 2. Learning Outcomes

Before we get started, let's go over the learning outcomes for this section of the workshop.

By the end of this lesson you will be able to,
1. Discuss what ETL and ELT Pipelines are.
2. Understand how to read and combine data that comes from different sources.
3. Create data pipelines using pandas and prefect.
4. Understand how to visualize the pipelines you create to help you with their development.

## 3. Tools

![cartoon]()



## 4. Version Control

### 4.1 Git

In [1]:
!mkdir -p ~/our_pyconza_proj/src

In [2]:
!ls ~/our_pyconza_proj

anaconda			 Documents	      Pictures	 TresoritDrive
aura-theme.dconf		 Downloads	      Postman	 Tresors
balenaEtcher-1.7.9-x64.AppImage  Music		      Public	 Videos
bank_statement_sept2022.pdf	 new_file.pdf	      R
cba_bank_statement_sept2022.pdf  nodesource_setup.sh  snap
Desktop				 our_pyconza_proj     Templates


Let's go over the next steps together.

1. Open a new terminal in your editor of choice.
2. Activate the same environment you have been using for this session `conda activate environment our_environment`
3. Navigate to the new folder `cd ~/our_pyconza_proj`
4. Copy our src files into the folder `cp pycon_za22_data_devops/src/*.py ~/our_pyconza_proj`
5. Initialize a new git repository with `git init`
6. Navigate to GitHub, log in, and create a new repository by clicking on top right-hand corner.
7. Add a `README.md` file with `echo "# pycon-za-test" >> README.md`
8. Let's start tracking our files `git status && git add .`
9. Let's commit our changes `git commit -m "Our first commit"`
10. Now let's push our changes to our remote repository 
    - First we need to add it with `git remote add origin https://github.com/ramonpzg/pycon-za-test.git` or with ssh `git remote add origin git@github.com:ramonpzg/pycon-za-test.git`
    - Then we can push our changes with `git push -u origin master`
    - Note: You might get an authentication error if this is the first time you are pushing changes to GitHub. To bypass this momentarily, go to right-hand corner and click on your profile icon and then go to,
        - `Settings` > `Developer settings` > `Personal access tokens` > `Generate new token`.
        - Then, for the purpose of this tutorial only, click on all of them and then generate the token.
        - Copy it somewhere safe as you might need it more than once.
        - Try pushing your changes again and use the code when prompted

Great now that we got set up tracking our code, let's get going tracking our data.

### 4.2 Data Version Control

Let's get started with dvc

1. Let's first create a data folder with a "fake" remote_storage init `mkdir -p data/remote_storage`
2. Run the `get_data.py function`
3. initialise our repository with `dvc init`
4. Add "fake remote" storage with `dvc remote add -d our_storage data/remote_storage`
5. Let's add our first file with `dvc add data/02_part/raw/SeoulBikeData.csv`
6. We can now start stracking a metadata about our data to version control it and manipulate where it goes. `git add data/02_part/raw/.gitignore data/02_part/raw/SeoulBikeData.csv.dvc`
7. Commit changes and push `git commit -m "the first dataset" && git push`
8. Let's track our data with dvc by using
    - `dvc status` # to check what's up
    - `dvc commit` # to commit all of the data in our repo
    - `dvc push` # to move the data to our not so remote repo for this demo
9. Now let's start creating our pipeline.
    1. Let's remove our data again with `rm -rf data/02_part` to see the full example
    2. `dvc run -n get_data -d src/get_data.py -o data/02_part/raw/SeoulBikeData.csv python src/get_data.py`
    3. Two files got created, one is `dvc.yml` and the other is `dvc.lock`. The former keeps track of the pipeline we started creating with `dvc run`, and the latter contains less human friendly quotes (hashes and other metadata)
    4. We can start tracking `git add data/02_part/raw/.gitignore dvc.lock dvc.yaml`
    5. Commit incrementally `git commit -m "initial steps of our pipeline"` and push `git push`
    6. Let's keep adding to our pipeline `dvc run -n prepare -d src/prepare_data.py -d data/02_part/raw/SeoulBikeData.csv -o data/02_part/interim/clean_data.parquet python src/prepare_data.py`
    7. Now we can use `dvc config core.autostage true` to stop adding each step individually.
    8. Another one `dvc run -n split_data -d src/split_data.py -d data/02_part/interim/clean_data.parquet -o data/02_part/processed/train.parquet -o data/02_part/processed/test.parquet python src/split_data.py`
    9. Another one `dvc run -n train -d src/train_model.py -d data/02_part/processed/train.parquet -o models/rf_model.pkl python src/train_model.py`
    10. Last one `dvc run -n evaluate -d src/evaluate_model.py -d models/rf_model.pkl -d data/02_part/processed/test.parquet -M reports/metrics.json python src/evaluate_model.py`
    11. We can evaluate what our dag of steps looks like with `dvc dag`.
    12. Now we can run it with `dvc repro`. If we make any changes to a step in the pipeline, the previous steps that were not changed will not get rerun as their ourput gets cached in memory.
    13. Lastly, let's keep track of everything `git add . && git commit -m "pipeline ready" && git push`

## 5. CI/CD Pipelines

### 5.1 GitHub Actions

![gh_actions](https://svrooij.io/assets/images/github-actions-banner.png)

Explain what it is

Create a `mypy` GitHub Actions wokflow.

In [3]:
!mkdir -p ~/our_pyconza_proj/.github/workflows

In [7]:
%%writefile ~/our_pyconza_proj/requirements.txt

pandas
scikit-learn
numpy
flake8
black

Overwriting /home/ramonperez/our_pyconza_proj/requirements.txt


In [8]:
%%writefile ~/our_pyconza_proj/.github/workflows/demo_pipe.yml

name: PyCon ZA Demo

on: [push]

jobs:
  build:

    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ["3.9", "3.10"]

    steps:
      - uses: actions/checkout@v3
      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v4
        with:
          python-version: ${{ matrix.python-version }}
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      - name: Lint with flake8
        run: |
          # stop the build if there are Python syntax errors or undefined names
          # flake8 ./src/get_data.py --count --select=E9,F63,F7,F82 --show-source --statistics
          # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
          flake8 ./src/get_data.py --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
      - name: Reformat with black
        run: |
          black ./src/split_data.py
      - name: Check types with mypy
        run: |
          mypy ./src/prepare_data.py

Overwriting /home/ramonperez/our_pyconza_proj/.github/workflows/demo_pipe.yml


Time to test our pipeline
1. Run `git status`
2. Add files `git add .`
3. Commit and push `git commit -m "added CI and requirements.txt" && git push`
4. Immediately go to the **Actions** tab in your repo and then click on **PyCon ZA Demo** > **your commit message** > **build 3.10**, and you will be able to evaluate what happened inside our workflows.

Congrats! You just ran your first pipeline 😎

### 5.2 Continuous Machine Learning

Now it is time to create a workflow for our machine learning models.

Steps
1. git checkout -b new_params_v1
2. Change Random Forests Parameters
3. dvc status
4. dvc repro
5. dvc status
6. dvc push
7. dvc metrics show
8. dvc metrics diff --show-md master
9. git add .
10. git commit -m "Testing New Params"
11. git push --set-upstream origin new_params_v1
12. git push
13. mkdir .github/workflows
14. Add

```yaml

name: bikes-pipeline-test # (1)
on: push # (2)
jobs: # (3)
  run: # (4)
    runs-on: [ubuntu-latest] # (5)
    container: docker://dvcorg/cml:0-dvc2-base1 # (6)
    steps: # (7)
      - uses: actions/checkout@v2 # (8)
      - name: cml_run # (9)
        env: # (10)
          REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }} # (11)
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} # (12)
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} # (13)
        run: | # (14)
          pip install -r requirements.txt # (15)
          
          dvc repro # (16)
          dvc push # (17)
          git fetch --prune # (18)

          echo "# CML Report" > report.md # (19)
          dvc metrics diff --show-md master >> report.md # (20)
          cml-send-comment report.md # (21)
```

The name of our CI/CD pipeline.
Indicates to GitHub that the pipeline should run every time we push changes to our repository.
Indicates steps and dependencies that should run on push.
Run what follows.
Use the latest Ubuntu operating system to test our code.
The environment will use a container with DVC and CML already installed in it.
Follow the next steps after the operating system and the environment have been set up.
This action checks-out your repository under $GITHUB_WORKSPACE, so your workflow can access it.
The name of our CML run.
What follows are environment secrets needed for the run.
Your GitHub access token needs to be accessible by your environment.
Your AWS ACCESS KEY ID needs to be accessible within the run in other to push the models to s3.
Your AWS SECRET ACCESS KEY needs to be accessible within the run in other to push the models to s3.
Run the following commands one after the other. The | is very important.
Install our dependencies. See this below.
Reproduce the pipeline using our dvc.lock and dvc.yaml.
Push the data (if it changed) and push the model to our remote repository.
Updates all remote branches.
Create a report markdown file.
Add the metrics in the master branch to our report. If this were a different branch, compare the results with those in master.
Send the report as an email/pull request.
Create a requirements.txt file.

In [44]:
%%writefile ~/our_pyconza_proj/.github/workflows/cml.yaml

name: testing-dvc-repro
on: [push]
jobs:
  run:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - uses: actions/setup-python@v2
      - uses: iterative/setup-cml@v1
        with:
          version: '0.18.0'
      - uses: iterative/setup-dvc@v1
      - name: train_model
        env:
          REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          pip install -r requirements.txt
          dvc dag
          dvc repro
          dvc push
          echo "# My Report" > report.md
          dvc metrics show --show-md >> report.md
          dvc dag >> report.md
          cml send-comment report.md
          cml comment create --publish report.md

Overwriting /home/ramonperez/our_pyconza_proj/.github/workflows/cml.yaml


## 7. Summary