# Workshop

This notebook show how to work with the platform. The possibles action to perform are:
- Deploy the Infrastructure locally with docker-compose.
- Start the main DAG on Airflow to scrap the site and populate the PostgreSQL Database.
- Consume the data for analytics purposes.

## Before Start

Is recomended to create a virtual environment to execute this workshop.
- Install Python on your machine
- Instal virtual env lib with **pip install virtualenv**
- Create an env with **python3 -m venv .venv**
- Active the environment depending on your OS
- Run **pip install -r requirements.txt** insite the new environment
- Run **jupyter notebook** in a terminal activated by the new environment
- Open **Workshop.ipynb** file to leanr about the platform.

## Deploy Infra

To deploy the infra locally, is necessary to have **docker** and **docker-compose** correctly configured locally.
After, is possible to start the deploy running the cell bellow:

In [None]:
!docker-compose up -d

Or is also possible to follow the example and use the deploy_infra_locally method to star.

## Cayena Class

A simple Class was developed to help the Data Scientis to work with the plataform, providing a simple way to interact with the principal services. All the necessaries variables are harde coded at the __init__ method.

In [None]:
import os
import json
from datetime import datetime as dt
import requests
import psycopg2
import pandas as pd

class Cayena():
    def __init__(self, user='airflow', password='airflow'):
        self.AUTH        = (user, password)
        self.DB_USER     = "cayena"
        self.DB_DATABASE = "cayena"
        self.DB_PASS     = "cayena"
        self.DB_HOST     = "localhost"
        self.DB_PORT     = 5432
        self.BASE_URL    = "http://localhost:8080/api/v1"
        self.DAG_ID      = "web_scraping_pipeline"
        
    def _connect(self):
        self.connection = psycopg2.connect(
            database=config.DB_DATABASE,
            user=config.DB_USER,
            password=config.DB_PASS,
            host=config.DB_HOST,
            port=config.DB_PORT)
        self.cursor = self.connection.cursor()

        
    def start_web_scraping_dag(self):
        url = "http://localhost:8080/api/v1/dags/web_scraping_pipeline/dagRuns"

        payload = "{}"
        headers = {
            'content-type': "application/json"
        }

        response = requests.request(
            "POST",
            self.BASE_URL + f'/dags/{self.DAG_ID}/dagRuns',
            data=payload, headers=headers, auth=self.AUTH)
        return response
    
    def check_dag_status(self):
        return json.loads(
            requests.request(
                "GET",
                self.BASE_URL + f'/dags/{self.DAG_ID}/dagRuns',
                auth=self.AUTH).text
        )
    
    def deploy_infra_locally(self):
        os.system('docker-compose up -d')
        
    def get_books_table_as_df(self):
        sql = "SELECT * FROM cayena.analytics.books"
        return self.get_query_result_as_df(sql)
    
    def get_query_result_as_df(self, sql):
        return pd.read_sql_query(sql, con=self.connection)
    
    def stop_workshop(self):
        os.system('docker-compose down --remove-orphans')
    

### Instantiate Class

When creating the object from our Class, we can interpretate as the User requesting access to the platform to manage it, so an Auth method can be requested at this point to provid access.

In [None]:
cayena = Cayena()

### Deploy Infra

As said before, is necessary to deploy the infrastructure locally in order to execute de current workshop. If you didn't set up you environment before with the **docker-compose** command, you can request the **deploy_infra_locally** method to start all the services locally.

In [None]:
cayena.deploy_infra_locally()

### Execute DAG

In order to populate the table **books** with the informations provided by the fake Web Site, we can use the method **start_web_scraping_dag** to trigger our main DAG using the Airflow API.

In [None]:
cayena.start_web_scraping_dag()

### Check DAG execution

Using the method **check_dag_status** is possible to retrive all the tries to execute the DAG and check whats is the current state of the pipeline.

### Query to DB

If you desire to fetch all the data from the **books** table, is possible to use the **get_books_table_as_df** method. Or if the whish is to run a personalized query, the method **get_query_result_as_df** is what you are lokking for.

In [None]:
cayena.get_books_table_as_df()

In [None]:
cayena.get_query_result_as_df("SELECT * FROM cayena.analytics.books LIMIT 10")

### Clean Up

After finish all the work, you can shut down the infrastructure runnig the cell bellow or executing the method **stop_workshop**.

In [None]:
!docker-compose down --remove-orphans

In [None]:
cayena.stop_workshop()