# Stack Overflow Data Pipeline

In this project, I will create a data pipeline in the Cloud using Apache Airflow.

The dataset I will use for this project is an archive of Stack Overflow content.

### Architecture

![flow_of_data](../screenshots/pipeline_data_flow.png)

This diagram illustrates the flow of data from the source database (AWS RDS - Source Database), through to the data processing step (AWS EC2 - Apache Airflow), and finally, insertion into the analytical database (AWS RDS - Analytical Database).

## EC2 Setup

First, I will setup an EC2 instance on AWS.

EC2 is a web service which provides computing capacity in the Cloud. 

I will configure Airflow to run inside my EC2 instance, so that the pipeline runs in the Cloud, independent from my local machine.

First, I log into AWS and create the instance on EC2:

![ec2_instance](../screenshots/ec2_instance.png)


Next, I connect to the EC2 instance via the terminal using a secure shell (SSH) connection.

![ec2_connect](../screenshots/ec2_ssh_connection.png)

To connect, I need to use the `pem` file which was generated when the instance was created.

I save this into my current directory and then run the following command:

In [None]:
ssh -i "batching-project.pem" ec2-user@ec2-18-130-230-199.eu-west-2.compute.amazonaws.com

![ec2_terminal](../screenshots/ec2_terminal.png)

## Apache Airflow

With a connection to the EC2 instance established, I can now install Apache Airflow inside it.

I first create a venv in the EC2 and then install my project dependencies inside that using `pip`.

With all my dependencies installed, I can now run the Airflow server and scheduler inside the EC2, using the following commands:

In [None]:
airflow db init

airflow scheduler

airflow webserver -p 8080

I use `tmux` to run multiple panels in my terminal.

This way, I can easily switch between the webserver, scheduler, EC2 terminal and local terminal.

![airflow_ec2](../screenshots/airflow_running_ec2.png)

Now that the webserver is running, I can visit it using the public EC2 IP address and the Airflow port:

http://18.130.230.199:8080/

To log in, I must first create a user in the EC2 terminal.

In [None]:
airflow users create \
    --username admin \
    --firstname Peter \
    --lastname Parker \
    --role Admin \
    --password example \
    --email spiderman@superhero.org

I am now logged into Airflow running on an EC2 instance.

![airflow](../screenshots/airflow.png)

## Connecting to RDS

With Airflow now set up, I can connect to the two databases held on the AWS Relational Database Service (RDS).

One database contains the raw Stack Overflow data (source), while the other is the analyical database, a currently empty database to which I will load transformed data (target).

To do this, I create two new connections in Airflow, using the relevant RDS credentials.

![rds_connection](../screenshots/rds_connection.png)

## DAGs

With everything connected up, I can now create my first DAG.

The goal of which is as follows:

- Your company wants to ensure that all posts loaded into the target database have a body field that is **not empty**. This is to ensure data quality and consistency in the target database. 
- Additionally, the company wants to avoid reprocessing the same data every time the ETL process runs.
- They are interested in these fields: `id`, `title`, `body`, `owner_user_id`, and `creation_date`.
- The DAG should run every 15 minutes.

To create new files inside my EC2, I opt to connect it to VSCode using a remote SSH extension.

From here, I can easily access the EC2 and create new files.

![vscode](../screenshots/vscode_ssh.png)

## DAG Breakdown