# Data engineering

By: Sherif Abdulkader Tawfik Abbas
- https://scholar.google.com/citations?user=NhT7vZIAAAAJ
- https://www.linkedin.com/in/sheriftawfik/

Topics we will cover:
- Introduction to data engineering
- Number cruncher: the `numpy` library
  - `numpy` arrays
  - Array operations
  - `numpy` mathematical functions
- Data work horse: `pandas`
  - The `pandas` `DataFrame` object
  - Loading and saving CSV files
- Text manipulation in python



## Introduction to data engineering

Machine learning uses data to train a machine learning model to perform a predictive task. The data that machine learning models are different from the data that a human can process. Humans can read text on a website and understand the text based on the rules of grammer, as well as by commonsense. Here are two examples:


### Example 1: Numerical data

Let's say we want to pass the global temperature data into a machine learning model, to train it to predict the global temperature in 20 years time. The data looks like this:

```
%                  Monthly          Annual          Five-year        Ten-year        Twenty-year
% Year, Month,  Anomaly, Unc.,   Anomaly, Unc.,   Anomaly, Unc.,   Anomaly, Unc.,   Anomaly, Unc.
 
  1751    12    -2.169  3.365    -1.153    NaN       NaN    NaN       NaN    NaN       NaN    NaN
  1752     1    -3.587  3.193    -1.157  1.358       NaN    NaN       NaN    NaN       NaN    NaN
  1752     2     1.556  4.577    -1.208  1.381       NaN    NaN       NaN    NaN       NaN    NaN
  1752     3    -0.292  2.431       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN
  1752     4    -1.894  1.592       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN
  1752     5    -0.422  1.465       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN
  1752     6     0.115  1.426       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN
  1752     7     0.760  1.319       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN
  1752     8    -0.958  1.273       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN
  1752     9       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN
  1752    10    -1.639  1.640       NaN    NaN       NaN    NaN       NaN    NaN       NaN    NaN
```

The data can be obtained from the [link](http://berkeleyearth.lbl.gov/auto/Global/Complete_TAVG_complete.txt), and will be examined in Case Study 01 of this course. For every month (the first column), the temperature anomaly (the difference in temperature) is recorded. But for some records, the the temperature is `NaN` which stands for "not a number". To pass the above table to a machine:

- First, we need to transform the above format into a format that the machine can process, such as the CSV format.
- Second, for `NaN` values, we should substitute them with numerical values. This is called "data imputation".

### Example 2: Natural language processing
```
Foreign Affairs Minister Penny Wong has asked Optus to cover passport application fees for anyone caught up in last week's massive data breach, which affected millions of Australians.
```

How does the human brain comprehend the above sentence? It applies rules of syntax, or grammar. The initial of Penny is upper-case, so we understand it's a name of a person. We know Optus is a company, not a person, so we won' be confused by "asked Optus"; we know Optus is not a person. We know "caught up" means "involved". We have some idea about last week's security breach that has hit the Optus servers. Finally, we know the sentence ended because of the full-stop.

Is that how a machine would understand the above text? The machine needs to know the structure of the sentence: whether it is a simple or a complex sentence, the clauses in the sentence, the objects (Penny, Optus, Australians data breach) and the verbs (cover, affected).

However, the machine does not read the text like we read it: the machine wants to translate that text into a numerical form first, and then process the numbers. As an example of a numerical form: representing each two consecutive words by a number. For example:

| Two words | Number |
| --- | --- |
| has asked | 1 |
| caught up | 2 |
| which affected | 3 |

What if we find "which affects", "have asked", and "catch up"? The meaning of "have asked" is the same as "has asked", but is different because of the word that comes before those two words. Given they have the same meaning, it makes more sense to represent them with the same number.

Therefore, before translating the words and sentences into numbers, we need to make things less confusing to the machine:
- plurals, such as "fees" and "millions", become singular
- different forms of a verb, such as "have" and "has", should be represented as one verb, such as "have"
- we should remove commas, full stops, etc., because they are not part of the vocabulary 

The modifications to the data in the above two examples are known as data preprocessing, or data pruning, or more generally data engineering.

Data engineering is the process of preparing data for machine learning. In python, it involves the use of several python modules, including `numpy` and `pandas`, to efficiently prepare the data.

In this class, you will learn how to use these modules to prepare numerical and text data.

## Number cruncher: the numpy library

`numpy` is one of the richest and most popular libraries in python. You can have a look at the wide range of mathematical functions in `numpy` here: https://numpy.org/doc/1.18/reference/index.html. `numpy` is particular popular because of its data structure the "numpy array". It allows flexibility in generating arrays and operating on arrays using mathematical functions.

Here, we will deal with two main features in `numpy`: the numpy array and mathematical functions.

### numpy arrays

`numpy` has a special collection type, the `numpy array`, which is far more suitable for performing numberical computations than python's default collection types. The important feature that distinguishes a `numpy array` from, say a python `list`, is the ability to perform element-wise operations. 

Let's start by creating an array of 3 numbers: 4,6 and 7. 


In [None]:
import numpy as np
a=np.array([4,6,7])
print(a)

[4 6 7]


Note the syntax: the `array` function is expecting a python collection type, and here we pass a `list` type.

Like in python's lists, we can find the length of the list using the `len()` function. We can also use `len()` in a `numpy array`, but we can also use the `shape()` function as follows:


In [None]:
import numpy as np
a=np.array([4,6,7])
print(a.shape)

(3,)


This means that we have a 1D array, with 3 elements. Let's create a 2D array.


In [None]:
import numpy as np
a=np.array([[4,6,7],[5,6,7]])
print(a.shape)

(2, 3)


To access the elements of a `numy array`, we also use the bracket notation like we did with python's lists. For example, to get the first row of the above array:


In [None]:
import numpy as np
a=np.array([[4,6,7],[5,6,7]])
print(a[0])

[4 6 7]



### Array operations

We can do a scalar multiplication of a `numpy array`:

In [None]:
import numpy as np
a=np.array([[4,6,7],[5,6,7]])
print(a*4)

[[16 24 28]
 [20 24 28]]


And we can easily perform an element-wise multiplication of two arrays:



In [None]:
import numpy as np
a=np.array([4,6,7])
b=np.array([7,8,9])
print(a*b)

[28 48 63]


Element-wise operations include all of python's arithmetic functions.

In [None]:
import numpy as np
a=np.array([4,6,7])
b=np.array([7,8,9])
print(a*b,a/b,a+b,a-b)

[28 48 63] [0.57142857 0.75       0.77777778] [11 14 16] [-3 -2 -2]


What about matrix multiplication? For these, we use the `numpy` function `dot()`:



In [None]:
import numpy as np
a=np.array([[4,6,7],[0,3,2]])
b=np.array([[7,9],[4,10],[-4,6]])
print(a.dot(b))
print(np.dot(a, b))

[[ 24 138]
 [  4  42]]
[[ 24 138]
 [  4  42]]


### numpy mathematical functions

`numpy` has a huge number of mathematical functions. For a quick overview:
- Statistical functions: https://numpy.org/doc/1.18/reference/routines.statistics.html
- Mathematical functions, including trigonometry and a lot more: https://numpy.org/doc/1.18/reference/routines.math.html

Using these functions with a `numpy array` is pretty straightforward. For example, let's compute the sum, average, maximum, minimum, standard deviation and variance of an array of numbers:

In [None]:
import numpy as np
a=np.array([1,5,4,7,5,3,5.7,3.4,2,6,8,7])
print(a.sum(),a.mean(),a.max(),a.min(),a.std(),a.var())

57.1 4.758333333333334 8.0 1.0 2.048356387177019 4.195763888888889


Note that these functions are constitute some of the common dscriptive statistical functions that we normally used when we start analyzing numerical data before we perform machine learning computations on them.

## Data work horse: `pandas`

Another data structure library is `pandas`, which is particularly suited for dealing with tabular data. It has quickly evolved over the years to become the global standar in representing and manipulating data in python.

As `numpy` represents data as arrays, `pandas` represent data in the form of a complex object called the `DataFrame`. Think of the `DataFrame` as a database table sheet: it has a name (the variaable name), fields with their titles, and possibly an index field (the primary key). DataFrames are particularly powerful owing to their poowerful querying and data transformation functionalities.

Let's create a simple `DataFrame` that represents a table of 2 columns, `fruit` and `price`:

In [None]:
import pandas as pd
p = pd.DataFrame({'fruit':['apple','banana','pear'],'price':[4,3.5,2]})
print(p)

    fruit  price
0   apple    4.0
1  banana    3.5
2    pear    2.0


Printing a `DataFrame` will format it as an actual table, with an index (the first column that has no title) and fields, each with its designated title.

We can retrieve the second row by using the `loc` function:

In [None]:
p.loc[1]

fruit    banana
price       3.5
Name: 1, dtype: object

We can query the `banana` record in this `DataFrame` by using also the `loc` function:


In [None]:
p.loc[p.fruit == 'banana']

Unnamed: 0,fruit,price
1,banana,3.5


Or we can find all records with prices higher than 2:

In [None]:
p.loc[p.price > 2]

Unnamed: 0,fruit,price
0,apple,4.0
1,banana,3.5


## Loading and saving CSV files

One of the very useful features of `pandas` is the loading and saving of CSV files. To save the above `DataFrame` in CSV format, we just do the following:

In [None]:
p.to_csv('fruits.csv')

To read the saved CSV file, it's also quite easy:

In [None]:
q = pd.read_csv('fruits.csv',index_col=0,header=0)
print(q)

    fruit  price
0   apple    4.0
1  banana    3.5
2    pear    2.0


## Text manipulation in python

One of the factors that catalyzed the popularity of python has been its versatility in text manipulation.

Let's start with a few of python's string functions:

- `s.upper()` and `s.lower()`
- `s.strip()`
- `s.split()` splits a sentence into words, assuming the space character is the separator character.
- `s.replace(a,b)` replaces all ocurrances of `a` with `b` in the string `s`
- `s.join()` joins all members of a collection into one string, separated by the the string `s`.
- `s.isnumeric()` checks if `s` is numeric.

In [None]:
a = 'this is some text'
print(a.upper(),a.lower())

a = '   lots of trailing spaces.    '
print(a)
print(a.strip())

print(a.split())

a = 'abcddefdd'
print(a.replace('d','x'))

print('abc'.join(['x','y','z']))

a = '456a'
print(a.isnumeric())

THIS IS SOME TEXT this is some text
   lots of trailing spaces.    
lots of trailing spaces.
['lots', 'of', 'trailing', 'spaces.']
abcxxefxx
xabcyabcz
False


These functions are very commonly used when we start *pre-processing* text in natural language processing, which is the *text cleaning* step.

That's all for this tutorial. In the next tutorial, I will go through more specialized and detailed procedures for data pre-processing.