## 20. Data Engineering

#### Resources

[Beginners Guide to Data Engineering Part 1](https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-i-4227c5c457d7)<br/>
[Beginners Guide to Data Engineering Part 2](https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-ii-47c4e7cbda71)<br/>
[What is Data Engineering?](https://www.dataquest.io/blog/what-is-a-data-engineer/)<br/>
[Data Engineer DDAT Description](https://www.gov.uk/government/publications/data-engineer-role-description/data-engineer-role-description--2)<br/>
[The Rise of the Data Engineer](https://medium.freecodecamp.org/the-rise-of-the-data-engineer-91be18f1e603)<br/>
[Data Pipelining](https://medium.com/the-data-experience/building-a-data-pipeline-from-scratch-32b712cfb1db)<br/>
[Apache Airflow](http://michal.karzynski.pl/blog/2017/03/19/developing-workflows-with-apache-airflow/)<br/>

#### Data Engineering

**What is Data Engineering?**  

Data science is a team sport. There are many different team roles, including: 

* Data Scientists
* Business Architects
* Data Architects
* Data Visualizers
* Data Engineers

Data scientists are only as good as the data they have access to. Most companies store their data in variety of formats across databases and text files. This is where data engineers come in — they build pipelines that transform that data into formats that data scientists can use. Data engineers are just as important as data scientists, but tend to be less visible because they tend to be further from the end product of the analysis and deal with a lot of the less glamourous underpinning elements as show in the diagram below.

![Data Science Hierarchy of Needs](./images/ds-hierarchy-of-needs.png)

This framework puts things into perspective. Before a company can optimize the business more efficiently or build data products more intelligently, layers of foundational work need to be built first.  

The field of Data Engineering could be thought of as a superset of business intelligence and data warehousing that brings more elements from software engineering. This discipline also integrates specialization around the operation of so called “big data” distributed systems, along with concepts around the extended Hadoop ecosystem, stream processing, and in computation at scale.

Among the many valuable things that Data Engineers do, one of their highly sought-after skills is the ability to design, build, and maintain data warehouses. Just like a retail warehouse is where consumable goods are packaged and sold, a data warehouse is a place where raw data is transformed and stored in query-able forms.

**Data Engineering Skills**  

A Data Engineer needs to be good at:  

* Architecting distributed systems
* Creating reliable pipelines
* Combining data sources
* Architecting data stores
* Collaborating with data science teams and building the right solutions for them

**What does a Data Engineer do?**

As Data Science exploded, Data Engineering was emerging as a complimentary discipline, taking cues from its sibling, while also defining itself in opposition, and finding its own identity. Like Data Science, Data Engineering is also a broad field, but any individual data engineer doesn't need to know the whole spectrum of skills, instead being 'T-shaped' with a single deep specialisation and a broad overall knowledge.

**Infrastructure**

In smaller teams or organisations, where no data infrastructure team has yet been formalized, the data engineering role may also cover the workload around setting up and operating the organization’s data infrastructure. This includes tasks like setting up and operating platforms like Hadoop/Hive/HBase, Spark, and the like. This would usually be designed by a data architect and built by a data engineer (Architects design, Engineers build).
<br/><br/>
**Data Modelling**

[Data Modeling](Data Modelling) refers to the practice of documenting software and business system design. Again, in larger organisations this would be the job of a data architect but sometimes falls to Data Engineers also. The “modeling” of these various systems and processes often involves the use of diagrams, symbols, and textual references to represent the way the data flows through a software application or the Data Architecture within an enterprise. This is expressed through an [Entity Relationship Diagram](https://www.lucidchart.com/pages/ER-diagram-symbols-and-meaning) (ERD) as follows:
![Data Model](./images/data-modelling.gif)

*PK = Primary Key FK = Foreign Key*
<br/><br/>
**Pipelining**

[Data Pipelining](https://medium.com/the-data-experience/building-a-data-pipeline-from-scratch-32b712cfb1db) is a set of actions that extract data (or directly analytics and visualization) from various sources. It is an automated process: take these columns from this database, merge them with these columns from this API, subset rows according to a value, substitute NAs with the median and load them in this other database. This is known as a “job”, and pipelines are made of many jobs illustrated in the diagram below:
![Data Pipeline](./images/pipeline.png)
<br/><br/>
**ETL**

[ETL](https://www.quora.com/What-is-ETL) stands for Extract, Transform and Load, the three stages in gathering and preparing data for storing for analysis.  **Extract** is the stage at which data is extracted from other homogeneous or heterogeneous data, **Transform** is where the data is transformed for storing in the proper format or structure for the purposes of querying and analysis, and **Load** is where the data is loaded into the target database. ETL used to be conducted through drap and drop interfaces, however more recently, there has been a shift towards a more programatic approach as the prevailing wind is that code is the best abstraction there is for software, although this is a contentious argument.
<br/><br/>

**Hadoop**

[Hadoop](http://hadoop.apache.org/) is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single to clusters of thousands of machines (also called **nodes**), each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a **cluster** of nodes, each of which may be prone to failures.

**HDFS**  
[HDFS](https://hortonworks.com/apache/hdfs/) stands for Hadoop Distributed File System. This is a method Hadoop uses for storing large amounts of data (Terabytes or Petabytes) across a large number of individual nodes. 

**Mapreduce**
[Mapreduce](https://hortonworks.com/apache/mapreduce/) is the original framework for writing applications that process large amounts of structured and unstructured data stored in the Hadoop Distributed File System (HDFS). Apache Hadoop YARN opened Hadoop to other data processing engines that can now run alongside existing MapReduce jobs to process data in many different ways at the same time.

**Spark**

Spark is a general-purpose data processing engine that is suitable for use in a wide range of circumstances. Application developers and data scientists incorporate Spark into their applications to rapidly query, analyze, and transform data at scale. Tasks most frequently associated with Spark include interactive queries across large data sets, processing of streaming data from sensors or financial systems, and machine learning tasks.

Despite common opinion, Spark cannot be compared directly to Hadoop, but [should instead be compared to HDFS](https://www.xplenty.com/blog/apache-spark-vs-hadoop-mapreduce/). Both HDFS and Spark have been designed for slightly different purposes with Hadoop being better at operations that require writing outputs to disk and Spark being better at in-memory operations where all the data being processed will fit into memory. 

**Graph Databases**

We live in a connected world! There are no isolated pieces of information, but rich, connected domains all around us. Only a database that natively embraces relationships is able to store, process, and query connections efficiently. While other databases compute relationships at query time through expensive JOIN operations, a graph database stores connections as first class citizens.

Accessing nodes and relationships in a native graph database is an efficient, constant-time operation and allows you to quickly traverse millions of connections per second per core.

Independent of the total size of your dataset, graph databases excel at managing highly connected data and complex queries. Armed only with a pattern and a set of starting points, graph databases explore the larger neighborhood around the initial starting points — collecting and aggregating information from millions of nodes and relationships — leaving the billions outside the search perimeter untouched.

Below is a visual example:

![Graph Database](./images/graph.svg)