Skip to content

Commit

Permalink
Merge pull request #85 from vbalalian/airbyte-minio-connection
Browse files Browse the repository at this point in the history
Airbyte minio connection
  • Loading branch information
vbalalian authored Jan 13, 2024
2 parents 3d9861d + ac6e407 commit 7f403cb
Show file tree
Hide file tree
Showing 11 changed files with 565 additions and 101 deletions.
19 changes: 19 additions & 0 deletions .env
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@

HOST=http://host.docker.internal

# Database name and credentials
POSTGRES_USER=postgres
POSTGRES_PASSWORD=nonprodpasswd
POSTGRES_DB=roman_coins

# Airbyte credentials
AIRBYTE_USERNAME=airbyte
AIRBYTE_PASSWORD=nonprodpasswd
CONNECTOR_START_DATE=2024-01-01

# Minio credentials
MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=nonprodpasswd
MINIO_BUCKET_NAME=roman-coins
MINIO_NEW_USER=roman-coins-user
MINIO_NEW_USER_PASSWORD=nonprodpasswd
50 changes: 20 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

Extracting, Loading, and Transforming data on Roman Coins gathered from wildwinds.com

**Tools:** Python, PostgreSQL, Docker, FastAPI, Airbyte
**Tools:** Python, PostgreSQL, Docker, FastAPI, Airbyte, MinIO

### [Web Scraper](web_scraping/web_scraper.py)

Expand All @@ -16,9 +16,13 @@ Scrapes data on coins from the Roman Empire from wildwinds.com, and loads the da

Serves data from the roman coins dataset, and allows data addition and manipulation via POST, PUT, and PATCH endpoints. Data is continuously added during web scraping.

### [Custom Airbyte Connector](custom-airbyte-connector/source_roman_coin_api/source.py)
### [Airbyte](airbyte-api-minio-connection/airbyte_connection_config.py)

Streams incremental data from the api.
[Custom airbyte connector](custom-airbyte-connector/source_roman_coin_api/source.py) streams incremental data from the API to a standalone MinIO bucket.

### [MinIO](https://min.io)

Resilient storage for the incoming data stream. Data is replicated ["at least once"](https://docs.airbyte.com/using-airbyte/core-concepts/sync-modes/incremental-append-deduped#inclusive-cursors) by Airbyte, so some duplicated data is acceptable at this stage. Deduplication will be easily handled by dbt at the next stage of the pipeline.

## Requirements:

Expand All @@ -28,37 +32,23 @@ Streams incremental data from the api.

## To Run:

**Step 1:**
Run Web Scraper and API:
**(Docker and Airbyte must be running in order to proceed)**

**Step 1:(Optional)** Set preferred credentials/variables in project .env file

**Step 2:** Run the following terminal commands

```
git clone https://github.com/vbalalian/roman_coins_data_pipeline.git
cd roman_coins_data_pipeline
docker compose up -d
docker compose up
```
Access version 1 of the API at http://localhost:8010/v1/ \
(Try out the different endpoints using the interactive documentation at http://localhost:8010/v1/docs)
This will run the web scraper, the API, and MinIO, then build the custom Airbyte connector, and configure the API-Airbyte-Minio connection. Currently, syncs must be manually triggered via the Airbyte UI. The next stage of this project is to handle orchestration via Dagster.

**Step 2:**
Build Custom Airbyte Connector Image:
```
cd custom-airbyte-connector
docker build . -t airbyte/source-roman-coins-api:latest
```
Access the API directly at http://localhost:8010, or interact with the different endpoints at http://localhost:8010/docs

Access the Airbyte UI at http://localhost:8000

**Step 3:**
Add Custom Airbyte Connector to Airbyte instance via Airbyte UI:
- Access Airbyte UI at http://localhost:8000
- Enter Username and Password (default: airbyte/password)
- Enter an email for service notifications
- Navigate to "Settings" -> "Workspace Settings" -> "Sources"
- Click "+ New connector"
- Click "Add a new Docker connector"
- Input fields:
- Connector display name: Roman Coins API
- Docker repository name: airbyte/source-roman-coins-api
- Docker image tag: latest
- Click "Add"
- Input field:
- start_date: (enter a date on or before the current date)
- Click "Set up source"
Access the MinIO Console at http://localhost:9090

View the web_scraper container logs in Docker to follow the progress of the Web Scraping
34 changes: 34 additions & 0 deletions airbyte-api-minio-connection/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Include any files or directories that you don't want to be copied to your
# container here (e.g., local build artifacts, temporary files, etc.).
#
# For more help, visit the .dockerignore file reference guide at
# https://docs.docker.com/engine/reference/builder/#dockerignore-file

**/.DS_Store
**/__pycache__
**/.venv
**/.classpath
**/.dockerignore
**/.env
**/.git
**/.gitignore
**/.project
**/.settings
**/.toolstarget
**/.vs
**/.vscode
**/*.*proj.user
**/*.dbmdl
**/*.jfm
**/bin
**/charts
**/docker-compose*
**/compose*
**/Dockerfile*
**/node_modules
**/npm-debug.log
**/obj
**/secrets.dev.yaml
**/values.dev.yaml
LICENSE
README.md
30 changes: 30 additions & 0 deletions airbyte-api-minio-connection/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# syntax=docker/dockerfile:1

ARG PYTHON_VERSION=3.10.12
FROM python:${PYTHON_VERSION}-slim as base

ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1

WORKDIR /app

ARG UID=10007
RUN adduser \
--disabled-password \
--gecos "" \
--home "/nonexistent" \
--shell "/sbin/nologin" \
--no-create-home \
--uid "${UID}" \
appuser

RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=bind,source=requirements.txt,target=requirements.txt \
python -m pip install -r requirements.txt

USER appuser

COPY . .

# Run the application.
CMD python3 airbyte_connection_config.py
Loading

0 comments on commit 7f403cb

Please sign in to comment.