# Data Engineering

There are multiple definitions of a data engineer that you can find around the web, but at Dataquest we have our own that we believe best fits the role. For us, a data engineer's responsibility is to build the architecture for a data platform that enables data analysts, scientiest, and other curious types to query their data without worry. Essentially, a data engineer needs to have the skills to build a data pipeline that connects all the pieces of the data ecosystem together and keep it up and running.

Data engineering is the first -- and arguably most crucial -- step for a successful data strategy. Data engineers make sure data scientists have the data they need to perform data science. They're responsible for:
- Accessing, collecting, auditing, and cleaning data from applications and systems into a usable state.
- Creating and maintaining efficient databases.
- Building data pipelines.
- Monitoring and managing systems, including [distributed systems](https://www.computerhope.com/jargon/d/distribs.htm).

### Data Science Hierarchy of Needs
| Level | Description |
| :-- | :-- |
| Learn/Optimize | AI, Deep Learning |
| Aggregate/Label | A/B Testing, Experimentation, Simple ML Algorithms |
| Exlore/Transform | Analytics, Metrics, Segments, Aggregates, Features, Training Data |
| Move/Store | Reliable Data Flow, Infrastructure, Pipelines, ETL, Structured and Unstructured Data Storage |
| Collect | Instrumentation, Logging, Sensors, External Data, User Generated Content |

Data engineers are responsible for the bottom two rows and also play a big role in the third row.

Data engineers' and data scientists' skills have some overlap at the boundaries of each role's knowledge, but they specialize in different things. For instance, given that data engineers are software engineers who work with data technologies, programming is the largest part of their training and workload, whereas data scientists ideally only program for analysis and predictions.

We'll start our journey by learning about programming. More specifically, we'll learn about programming in Python. This will be our main focus for this course and the next. In later courses, we'll build on this programming knowledge to learn how to create databases, and other data engineering skills

### Programming in Python

---
1. Instruct the computer to calculate `23 + 7`.

In [1]:
23 + 7

30

---
### The print() Command

On the previous screen, we instructed the computer to perform a single computation: `23 + 7`. However, we can ask the computer to perform more than just one computation.

---
1. Using the `print()` command, display the result for:
- `40 + 4`
- `200 - 25`
- `14 + 3`

In [2]:
print(40 + 4)
print(200 - 25)
print(14 + 3)

44
175
17


---
### Python Syntax

Previously, we sent the computer three instructions and wrote each on a separate line. If we were to put them all on the same line, we'd get an error.

`print(23 + 7) print(10 - 6) print(12 + 38)` resulted in red text describing a **syntax error**. This is because all programming languages, -- Python included -- have syntax rules. Each line of instruction has to comply with these rules.

---
1. Run the instructions below in the code editor. Remember that each instruction must be on a separate line.

In [3]:
print(30 + 10 + 40)
print(4)
print(-3)

80
4
-3


---
### Computer Programs

Let's get more practice with writing code.

---
1. By using the `print()` command, write a program that has three lines of code and:
- Displays the result of `34 + 16`.
- Displays the number `34`.
- Displays the number `-34`.
2. Run the program you wrote.
3. Submit Answer.

In [4]:
print(34 + 16)
print(34)
print(-34)

50
34
-34


---
### Code Comments

The computer executes code from the first line downwards and ignores blank lines.

Besides blank lines, the computer also ignores any sequence of characters that comes to the right of the `#` symbol on the same line. In the example below, we use `#` before `print(5 + 1)`, and we see the output of `print(5 + 1)` is not displayed anymore -- this is because `print(5 + 1)` is not executed when it's preceded by a `#`.

In [5]:
print(5 + 1)

6


In [6]:
# print(5 + 1)

The sequence of characters that follows the `#` symbol is called a **code comment**. We can also use code comments to add information about our code:

In [7]:
print(5 + 1)
print(8 - 7) # This is the line that outputs 1

6
1


Another way we could use code comments is adding a general description at the beginning of our program.

In [8]:
# Test program
print(5 + 1)
print(8 - 7)

6
1


---
In the code editor on the right, we already added these three lines of code:

In [9]:
# print(34 + 16)
# print(34)
# print(-34)

1. Uncomment these three lines of code by removing the `#` symbols, and then click the Submit Answer button.

In [10]:
# INITIAL CODE
# print(34 + 16)
# print(34)
# print(-34)

In [11]:
# INITIAL CODE
print(34 + 16)
print(34)
print(-34)

50
34
-34


---
### Arithmetical Operations

Previously, we wrote programs that only performed additions and subtractions. We can also perform multiplication and division in Python. To perform multiplication, we need to use the `*` character. For instance, this is how we multiply `3` by `2`:

In [12]:
3 * 2

6

To perform division, we use the `/` character. This is how we divide `3` by `2`:

In [13]:
3 / 2

1.5

We can also perform exponentiation (raising a number to a power) by using `**`. For example, this is how we can raise `4` to the power of `2` (in mathematical notation, we'd write this as 4^2).

In [14]:
4**2

16

The arithmetical operations we do in Python follow the usual order of operations we know from mathematics. Parentheses are calculated first, then exponentiation, then division and multiplication, and finally, addition and subtraction.

In [15]:
print(4 + 2 * 10)
print((4 + 2) * 10)

24
60


Looking at the code example above, we can deduce from the first operation (`4 + 2 * 10`) and its corresponding result (`24`) that multiplication precedes addition. However, for the second operation (`(4 + 2) * 10`), the addition is calculated first because this time it's surrounded by parentheses. Consequently, the result is `60`.

So far we've used space characters between numbers and operators (`+`, `-`, `*`, `/`, `**` are operators). For instance, we've used `4 + 5` instead of `4+5`. But Python's syntax rules do not enforce this, so both `4 + 5` and `4+5` will run correctly. However, we encourage you to use spaces in your own code as this helps with readability.

---
1. Write a program with three lines of code that performs the following arithmetical operations and displays the results (using the print() command):
- 16 x 10
- 48 / 5
- 5^3

In [16]:
print(16 * 10)
print(48 / 5)
print(5**3)

160
9.6
125
