# 01 - Introduction to Big Data

## Introduction to Big Data

### Gartner's 3Vs

- **V**olume
- **V**ariety
- **V**elocity

---

### Some history

- 2003 - Google Distributed File System
- 2004 - Google’s MapReduce
- 2006 - Hadoop

*Notes*

*In particular, the history of Distributed File Systems (GFS, NDFS, HDFS)*

*Links*

*See [Google's paper on GDFS](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/035fc972c796d33122033a0614bc94cff1527999.pdf) and the excellent [History of Hadoop](https://medium.com/@markobonaci/the-history-of-hadoop-68984a11704).*

---

### **Requirements of Distributed File Systems**

- **schemaless** with no predefined structure, i.e. no rigid schema with tables and columns (and column types and sizes)
- **durable** once data is written it should never be lost
- **capable of handling component failure** without human intervention (e.g. CPU, disk, memory, network, power supply, MB)
- **automatically rebalanced** to even out disk space consumption throughout cluster

### Distributed processing

**Vertical scaling vs horizontal scaling**

![](images/vertical_vs_horizontal_scaling-f494091d-dcb2-4586-82cd-dbc0a9d239fc.png)

Source: [https://blog.turbonomic.com/blog/on-technology/the-essentials-of-database-scalability-vertical-horizontal](https://blog.turbonomic.com/blog/on-technology/the-essentials-of-database-scalability-vertical-horizontal)

*Notes*

***Vertical scaling:** adding more power (CPU, RAM) to an existing machine.*

*It's expensive and limited by Moore's Law.*

***Horizontal scaling:** adding more machines into your pool of resources and connecting them into a cluster.*

*Allow usage of commodity hardware, e.g. you don't need supercomputers.*

---

**When do we need distributed processing**

- Data won't fit in the memory of a single machine
- Compute can be chunked into small pieces and parallelized (either analytics, ETL or modeling)
- We want to run multiple computation in parallel, for example, testing different hyperparameters of a machine learning model

---

**Distributed computing is hard**

> In a distributed system, anything that can go wrong, will go wrong.

*Notes*

*Failure is not an option, **failure is expected**.*

*When you're running a cluster of thousands of machine, the probability that at least one will fail is actually very high. If one job dies or is slow, the whole process will fail. You need safeguards against this. (RDDs for example)*

Links

- Berkeley CS61a [Chapter 4: Distributed and Parallel Computing](http://wla.berkeley.edu/~cs61a/fa11/lectures/communication.html#distributed-computing)
- [Byzantine Generals problem](https://medium.com/all-things-ledger/the-byzantine-generals-problem-168553f31480)
- [8 fallacies of distributed computing](https://www.youtube.com/watch?v=Q4p-2WIS0nQ)

---

**MapReduce**

![](images/map_reduce_word_count_process-ed4d1e0b-1180-4609-88e1-e2b1054829e7.png)

Source: [https://www.oreilly.com/library/view/distributed-computing-in/9781787126992/5fef6ce5-20d7-4d7c-93eb-7e669d48c2b4.xhtml](https://www.oreilly.com/library/view/distributed-computing-in/9781787126992/5fef6ce5-20d7-4d7c-93eb-7e669d48c2b4.xhtml)

*Notes*

*MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. (Wikipedia)*

***Map**: each worker node applies the map function to the local data*

***Shuffle**: worker nodes redistribute data based on the output keys (produced by the previous map step), such that all data belonging to one key is located on the same worker node*

***Reduce**: worker node process each group of data, per key, in parallel*

*The key contributions of the MapReduce framework are not the actual map and reduce functions  but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine. The use of this model is beneficial only when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerance features of the MapReduce framework come into play.*

---

### Apache Hadoop

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/664px-Hadoop_logo-381ff611-6909-4d30-bbd7-68b6a3257526.png)

*Notes*

*A popular open-source implementation of the MapReduce paradigm (plus a distributed file system (HDFS) and some other parts like a resource manager.*

---