![Zühlke](./images/zuehlke_logo_rgb_small.jpg)

---

# Who am I and why am I here?

I'm Wolfgang and I work for Zühlke's Data Analytics Team...

...and we're always looking for data engineering talents.

I have been here before: Enterprise Computing and "Fast and Furious".

You (so I've been told) are curious to hear real-life stories.

It appears we have a deal!

---

# Resources:

### https://github.com/smurve/HSR2019

### https://github.com/Project-Ellie/home-in-time

---

# Data Engineering is Software Engineering

Data engineers write software that deals with data.

Data engineers are in high demand.

Data engineers sometimes get into ML, too!

Data engineer / ML engineer / Data scientist - ???

---

# Skills of a Data Engineer
- Knows traditional DBs and SQL well
- Applies data visualization
- Has a basic understanding of statistics
- Has a good idea (if not more) about ML
- Can write distributable, efficient code
- Wants to automate everything
- Is always security-aware (GDPR, etc)

---

### [The hardest part of ML is not ML!](https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf)

<img src="images/ML_a_small_fraction.png" style="width: 700px;"/>


---

# What do you bring to the table already?
- Python?
- Tensorflow?
- TF 2.0 alpha?
- Lua, R, Julia, Torch, etc?
- Big data?
- Machine Learning?
- Deep Learning?

---

# Our project: "home in time"
### Predicting flight delays
https://github.com/Project-Ellie/home-in-time

We discuss the project and stray away into different topics.

Hardly any subject is in-depth. 

Theoretical background (if any) through references.

More in-depth material in additional Jupyter notebooks.

---

# Flight data from Atlanta
<img src="images/some-flight-data.png" style="width: 700px;"/>

---

# Predict flight delays - Really?
Flight delays are - unfortunately - unpredictable.

But still there are patterns: Wheather, airline reliability...

But flight delays have a fat tail:

<img src="images/fat_tails.png" style="width: 600px;"/>

---

# Predict flight delays - Really?

"Smart" prediction: display the probability distribution. [See *collateral*](https://github.com/smurve/HSR2019/blob/master/collateral/Fat_Tails.ipynb)


<img src="images/probability_distribution.png" style="width: 600px;"/>

---

# Data Exploration
- Play with billions of records?
- We need a fast analytical database.
- At any scale.
- We need SQL, still!
- Only a world-class cloud allows for (almost arbitrary) up-scaling.

---

# Analytical Databases
- Amazon Redshift
- Google BigQuery
- Azure Cosmos DB

# Architecture: 
- Multi-core/distributed query execution
- Append-only
- Weaker consistency guarantees

---

# Exploring Flight data (home-in-time)
[00_Data_Exploration](https://github.com/Project-Ellie/home-in-time/blob/master/00_Data_Exploration.ipynb)

<img src="images/Max_Dep_Delay.png" style="width: 500px"/>

---

# Deployment Architecture

<img src="images/Deployment_Architecture.png" style="width: 600px;"/>

---

# Training and Prediction
<img src="images/Training_Prediction.png" style="width: 600px;"/>

---

# Training data
- Some models require millions or even billions of training records
- Training data needs to be 
 - collected
 - cleansed
 - re-formatted
 - aggregated
 - preprocessed
 - combined from different sources

---

# Signature and Training Stage
- Reproduce all pre-processing steps during prediction!
- Failure leads to "training-serving skew"

<img src="images/Signature_vs_Training.png" style="width: 500px"/>

---

# Fast Data Processing with Beam Pipelines
- Apache Beam is a de-facto standard
- Supports real-time and batch processing with the same code.
- Programming model: directed acyclic graphs 
- Test execution local on any machine
- production-scale parallel execution on a cluster
- Map/Reduce/Shuffle automatically optimized


---

# Programming a pipeline 

In [None]:
with beam.Pipeline('DirectRunner', PipelineOptions()) as p:

    csv_encoder = tft.coders.CsvCoder(ORDERED_TRAINING_COLUMNS, TRAINING_METADATA.schema)    

    _ = (p 
         | 'read_from_csv' >> beam.io.ReadFromText(
             file_pattern='../testdata/test.csv', coder=csv_encoder)
         
         | 'write_to_csv' >> beam.io.WriteToText(
             file_path_prefix='./out.csv', coder=csv_encoder)
        )


---

# A Production Beam Pipeline in action
<img src="images/Dataflow.png" style="width: 600px"/>

---

# Fodder for the Model
See: [Input Functions](https://github.com/Project-Ellie/home-in-time/blob/master/03_Input_Functions.ipynb)
- Process any number of files
- Create a continuous stream of decoded records
- Repeat the data stream (epochs)
- Shuffle the data to stabilize learning
- split the data in efficient batch sizes
- automatically iterate over those batches
- prefetch data, use multiple threads in parallel
- distribute data stream if possible.

---

# Tensorflow

Fundamental concepts: Directed Graphs and Sessions

Hardware abstraction and optimal use of GPU/TPU resources

Distributable without code chance

Fully-featured DL Library

We'll learn to use Tensorflow in the exercises

---

<img src="images/TF_programming_model.png" style="width: 700px"/>

---

# Exercises
[Tensorflow introduction](https://github.com/smurve/HSR2019/blob/master/exercises/TF_Introduction.ipynb)

---