Airbnb Data 2022

Table of content

Description
Used tools
Pipeline
Replication
Requirements
Execution
How does it work?
Improvements

Description

Airbnb dataset as of 2022. Purpose is to get some insights and comparisons about airbnb rentals all over the world. Developed as final project for the 2023 Data Engineering Zoomcamp. For that reason airbnb data is taken on a quarterly base including data of 61 cities, mostly from Europe and the US. Finding are for example, that cities like Berlin, Vienna or Montreal cheaper on average compared to Amsterdam or New York, which was to expect. However, Melbourne and Sydney are the most expensive cities after the region of Hawaii, which was not obvious. Also Cape Town or New Orleans are rather expensive. Comparing evaluations, on average the ranking is similar.

Find the dashboard here

Tools

python
terraform
docker
gcs
big query
dbt
makefile
spark
google datastudio aka lookerstudio
ci/cd using megalinter with github actions

Pipeline

A makefile executes a terraform script, which creates an gcs bucket with a random number, because every bucket must be unique. After that, a database is implemented in biq query and raw tables added. For this two sql-files, stored in infrastructure are executed. Following that, a container creates the dbt environment, including needed dbt profile. Next step is then fetching data. Originally coming from insideairbnb, it is now also stored within a gcs bucket, since this data is only provide on a quarterly basis for the last 12 months, so recreation would otherwise impossible on a later point of time. Therefore, using spark is actually not necessary, copying directly from gcs to gcs would be possible. However, to simulate a more realistic scenario, spark is used to fetch and copy the data into an own gcs bucket, which serves as data lake. From there a short python script ingest the data into Big Query, used as data warehouse.

Within Big Query two raw tables contains raw data. Before raw data is ingested, these tables will be cleaned and do not contain any data then. The complete new data is inserted into these tables. Afterwards, an ephemeral dbt model make some type transformations and join both tables together. Subsequently, dbt snapshot builds two dimensional tables, before another dbt model builds a fact table to complete the starschema data model. Due to its size, for the fact table partioning and cluster techniques have been used. Finally, dbt build some views which provides some aggregated values to use later in lookerstudio.

For understanding, since fresh data is only provided every three months, the makefile is used to execute the complete pipeline and loops over the available datasets. That means, the makefile itself is the orchestration file, since it makes sure, every step is executed only after the step before is finished. Therefore, a "real" orchestration tool is not implemented (however, prefect is provided as docker, but no files added, since it is simply not needed).

dbt lineage:

Replication

This project can be replicated on Linux, macOS and Windows. However, it requires next to gcp credentials also some settings. Be aware, that depending on your resources and internet connection, recreation of this project may need some time, approx 30-60 Minutes.

Requirements

Create an account on gcp
Make a new project
Open IAM API
Create and download service account key
provide this key and project id (see execution)
make (on Debian/Ubuntu it can be installed by apt-get install make)
docker > 20.10.17 (doesn't work with this docker version, which is default in snap. Tested with docker 23.0.3. Have look at docker for your OS)
If you're using a VM, don't make it too small. Process was tested on gcp VM, e2-standard-4, x86/64 architecture, 30 GB disk space using Debian-11-bullseye and Ubuntu 22.04 LTS

Execution

clone this repository
within this repository two text-files are provided: key.txt and gcs_project.txt Store the path and name of your file into the first file. You can edit it, for example by using nano key.txt. Make sure, not spaces or lines exist. In the second file set the id of your project. Both files contain an example.
To update docker on Debian/Ubuntu execute make docker
The first two points are optional. To build now the infrastructure execute make infrastructure
The last step requires the most time, since a lot of data is transfered two times from gcs to gcs to Big Query. In addition dbt is executed for every loop to provide slowly changing dimension tables. Execute it with make pipeline
As a tip: depending on individual settings it might be, that the process requires serveral times to enter sudo password. So, setting up the time is recommended. By executing EDITOR=nano visudo (without editor=nano, vi is used), file /etc/sudoersis opened. Add Defaults timestamp_timeout=60 to increase the timeout to 60 Minutes. Please use visudo for that, because it locks sudoers-file, save edits to a temporary file and checks for syntax errors, before copying to the original file. This is recommended, since any error make sudo unusable. For more information, including security risks have a look here

How does it work?

In more detail, the process works as follows:

Running make infrastructure
1. a docker container is build, which allows to execute terraform scripts
2. the container from i. will be executed, mounts the key and project id and execute the terraform script stored in "infrastructure". After running the container will be deleted.
3. the containers for spark, creating raw tables on big query and data ingestion to big query are build.
4. the container, which creates raw tables is executed, taking the sql-files, which create two raw tables on big query
5. an image for dbt and another for dbt profile is build. After that, the image for the profile is executed
Running make pipeline
1. the spark container loads data from one gcs bucket to another, by using a python script. (If wanted, one can use this image to play around. A jupyter notebook is provided and ca be used within this container)
2. now a loop runs for every quarter a container, which ingest the raw data from the data lake into big query. After that another docker runs dbt to add the data into dim and fact tables.

Improvements

If one has access to monthly or even daily updated data, one should implement an orchestration tool like airflow (a prefect container is already provided and could also be used). However, for quarterly data, it is easier to add simply the quarter within the loop in the makefile. Main purpose of this project is learning and having fun, meaning some room is left for improvements, due to a lack of time. For instance, unit and performance tests could be inclueded. In general the performance of spark and ingestion to big query could be improved. Finally, implementing some tests, whether the needed tools are installed and/or updated, like docker, would be nice.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.github/workflows		.github/workflows
dbt/dbt_project		dbt/dbt_project
docker		docker
graphic		graphic
infrastructure		infrastructure
ingestion		ingestion
raw_data		raw_data
spark		spark
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
gcs_project.txt		gcs_project.txt
key.txt		key.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Airbnb Data 2022

Table of content

Description

Tools

Pipeline

Replication

Requirements

Execution

How does it work?

Improvements

About

Releases

Packages

Languages

License

shalltearb1oodfallen/airbnb

Folders and files

Latest commit

History

Repository files navigation

Airbnb Data 2022

Table of content

Description

Tools

Pipeline

Replication

Requirements

Execution

How does it work?

Improvements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages