# Understanding your skills

### Introduction

You know a lot.  You've learned backend engineer, cloud computing, how to build serverless data pipeliness, more classic data pipelines with airflow, and ELT pipelines with DBT.  You can analyze data and perform data storytelling.  And you've already added value to an organization through your externship.  

You've earned the skills necessary to qualify as a data engineer but a range of positions in the data world.

Let's begin by reviewing your skills as a data engineer.

### Your data engineering skills

Below, we'll use the pyramid principal to describe our data engineering skillset.

We see the data engineering skillset as a being composed of backend engineering, cloud computing, and data pipelines.  

Below you can see this.  

<img src="./data-eng-pipeline.png">

1. Backend engineering

To the left is the Python ETL work involved in pulling data from an API, by scraping HTML, or by pulling it from an OLTP database and storing either directly to an analytics database, or in this case, to a data lake in S3 (so someone like a data scientist can query it).

2. Coud computing

We deployed this code to the cloud using both Docker (which stores our code and related dependencies (eg. Python version, pip libraries, commands to get running), and AWS as cloud provider which *hosts* our docker image.

3. Data pipelines

By data pipelines we mean developing a system that extracts, stores, and transforms our data (ELT in this case).  And here we use s3 as our data lake for data scientists, then move the data to our data warehouse, and repeatedly transform it with DBT.  The end goal is to transform the data until we can deliver to internal stakeholders who may be less technical than our data scientists.  And to do so in the form of CSV files (called reports) or dashboards (eg. tableau, PowerBI).

#### Serverless data pipeline

Remember that we also developed a [serverless data pipeline](https://github.com/data-engineering-jigsaw/airflow-fullstack-etl/tree/solution/codebase).

<img src="./serverless-pipeline.png" width="80%">

The overall structure is really the same, but in this case, we first store our raw data to s3 (for say data scientists), then transform our data to be more structured and store that structured data s3, so that we can ultimately load it into an analytics database.  From there, we can transform it further with DBT.  To keep each of these steps independent, and have the ability to invoke them independently we wrap each step in a separate docker container that can be invoked through a lambda function.  We use airflow as our orchestrator to invoke these lambda functions in seequence.  And DBT to further transform the data.

And again, we can end by displaying the data in a dashboard like tableau. 

#### What about data analysis?

Where does the data analysis come in?  Really in three different places.

<img src="./serverless-pipeline.png" width="80%">

1. Exploring the source layer/potential sources

* Before we set up a pipeline to repeatedly pull data from a source like the Amadeus API, we may want to explore that dataset to see if it is worth pulling data from.  There we can use our skills of checking the completeness/representativeness of the data, and whether the features help to explain any target (like revenue, etc.)

2. The data lake (raw data)

Remember that the data lake is a place we can store our raw unprocessed data.  This is another good place to explore, because we can use data exploration to determine what features are worth transforming and ultimately providing to external stakeholders.

3. Data Dashboards/CSV files

Finally, we'll have less data to work with, but we can also explore our data at the end of our pipeline, when we are presenting data with our data dashboards or other visualizations.  Finally, this is prime space for our data storytelling skills.


### Summary

In this lesson, we described our data engineering skills by thinking through our end to end data pipeline.

<img src="./data-eng-pipeline.png" width="70%">

As we saw, we can summarize our data engineering skills as (1) backend engineering (Python, SQL, Flask, Object oriented design), (2) cloud computing (AWS, Docker, Bash), and (3) data pipelines (Airflow, DBT).

We built a serverless data pipeline which involved loading raw source data into s3 (in our data lake), and then transforming that data into structured data, before loading it to an analytics database for further transformation through DBT until it is in OLAP form and ready to present to stakeholders through a dashboard, CSV file, or OLAP data model.  