Data Engineering Primer

Source: https://www.youtube.com/watch?v=qWru-b6m030

Data Engineering Primer

How data engineering works?

The Excel method:

Multiple data sources >> manually collect and organize data in the required format >> developer uses excel to extract meaningful insights and do visualizations manually >> end user gets to see his/her report.

ETL pipeline.

To automate the above process, Data engineer builds a pipeline.

Extract:

set up an API connection to all the data sources.

Transform:

remove errors from the data,
change it to required format.
map the same types of records to each other
data validation.

Load:

Load the processed data to a database(MySQL, etc)

BI tools:

Power BI, Tableau can be used to present the visualization to end Analytic user.

The data engineer will build a script to run this entire operation in a batch wise so the above task can be automated and the end user gets the report without any intervention of human being.

Problems with the above ETL pipeline setup:

The above pipeline fails when the data size starts to increase.
Standard transactional DBs (mysql), are optimized to perform write operations, they are very resilient and are great to run apps,
- however they aren't optimized to do analytical jobs
- they suffer when processing complex queries.

Data Warehouse.

Instead of a standard DB, the new place to keep the consolidated data from multiple sources, is Data Warehouse.

Data Warehouse is a repository that consolidates data from all sources in a single central place.
idea of a warehouse is to structure/organize data that gets into tables and then tables into schemas(relationships between tables.)

Main difference between warehouse and database:

Warehouse is specifically optimized to run complex analytics queries as opposed to simple transaction queries of a regular database.

The pipeline with Warehous:

data is generated at sources,
automatically pulled by ETL scripts, transformed and validated on the way,
finally populates the tables inside the warehouse.

Data Lakes.

What happens to this pipeline/process when a Data Scientist is involved. How do Data Scientist's and Data Engineers collaborate.

Data Scientist:

job is to find hidden insights in data and make predictive models to forecast the future.
a data warehouse may not be enough for these tasks.
- because it is structured around reporting on the metrics that are defined in advance.

To fix this issue, Data Engineers could build custom ETL pipelines to assist the Data Scientists.

or, we can use the ...

Data Lake:

data lake is the complete opposite of data warehouse.
keeps all the data raw without preprocessing it and imposing a defined schema.
Follows the ELT(Extract, Load, Transform) model.
- because it's the data scientist who gets to decide how to transform the data as per their model requirements.

Hence, the job of Data Engineer is to enable the constant supply of info into the data lake.

What is Big Data.

4V's of big data:

Volume(big),
Variety(both structured and unstructured),
Veracity(requires quality control),
Velocity(real-time)

Cue to...

Data Streaming.

Upto this moment, we've only discussed batch data, which means that there is some schedule involved from data extraction to transformation.

However, there are applications and usecases where the new data is generated by the second and we need to stream it to the analytical systems right away.

Publish/Subscribe communication:

Synchronous communication, where data provider and system(that wants the data) communicate over API.
- this type is based on the request-respone model. \
- This gets incredibly slow when huge volume of data is involved.
Assynchronous communication, the pub/sub model enables assynchronous conversation between multiple systems that generates a lot of data simultaneously
- it decouples data sources from data consumers
- the data is divided into different topics/profiles.
- data consumers that subcribes to these topics, when a new data record or an event is generated, it's published inside the topic,
- allowing the subscribers to consume this data at their own pace.
- This way there is no waiting between systems to send messages.
Kafka is the most popular pub/sub tech.

Distributed computing.

petabytes of data are stored in several servers(sometimes thousands) called as Server Cluster.
Hadoop is a technology used for distributed storage.
- framework that allows the storing of data in clusters.
- higly scalable
- offers much redundancy for securing info
Spark is a popular data processing framework.
- it helps operate hadoop clusters.

Finally the big data engineering pipeline looks like:

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
apache_kafka		apache_kafka
apache_spark		apache_spark
git		git
images		images
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Primer

How data engineering works?

The Excel method:

ETL pipeline.

Extract:

Transform:

Load:

BI tools:

Problems with the above ETL pipeline setup:

Data Warehouse.

Main difference between warehouse and database:

The pipeline with Warehous:

Data Lakes.

Data Scientist:

Data Lake:

What is Big Data.

4V's of big data:

Data Streaming.

Publish/Subscribe communication:

Distributed computing.

About

Releases

Packages

Languages

samBoyySpirit/data_engineering_projects

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Primer

How data engineering works?

The Excel method:

ETL pipeline.

Extract:

Transform:

Load:

BI tools:

Problems with the above ETL pipeline setup:

Data Warehouse.

Main difference between warehouse and database:

The pipeline with Warehous:

Data Lakes.

Data Scientist:

Data Lake:

What is Big Data.

4V's of big data:

Data Streaming.

Publish/Subscribe communication:

Distributed computing.

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages