# Data preparation

## Library `numpy`

The [`numpy`](https://numpy.org/) library provides numerical computing in Python. It contains effective implementation of data structures such as vectors, matrices, and arrays. All data structures are derived from the data type $\texttt{array}$. 

### Creating arrays
We can create an array in different ways:

* by converting Python lists or tuples,
* using the functions $\texttt{arange}$, $\texttt{linspace}$, and the like,
* by reading data from files.

#### Conversion of lists into multi-dimensional arrays

We use the constructor $\texttt{array}$ directly by submitting a list.
If we give a list of numbers, we get a vector:

In [2]:
v = np.array([1, 2, 3, 4])
v

array([1, 2, 3, 4])

If we give a list of lists, we get a matrix:

In [3]:
M = np.array([[1, 2], [3, 4]])
M

array([[1, 2],
       [3, 4]])

Regardless of the shape, the objects $\texttt{v}$ and $\texttt{M}$ are of type $\texttt{ndarray}$.

In [4]:
type(v), type(M)

(numpy.ndarray, numpy.ndarray)

The difference is in their dimensions. The object $\texttt{v}$ is a vector with four elements, and $\texttt{M}$ is a `2 x 2` matrix.

In [5]:
v.shape

(4,)

In [6]:
M.shape

(2, 2)

Similarly, we can display the number of items in the entire list.

In [7]:
M.size

4

##### Question 1-1-1

We can compose arrays of any dimension. Try to create a list of lists (of lists, ...) and check out what its dimensions are!

In [8]:
X = np.array([[1,2],[3,4]])
X.shape

(2, 2)

[Answer](201-1.ipynb#Answer-1-1-1)

#### Functions for creating arrays

The `numpy` library contains functions for generating common array types. Let's look at some examples.

**The `arange` range**

The `arange` function returns an array with evenly spaced values within a given interval. 
```python
np.arange([start, ]stop, [step])
```
* `start`: optional, the first value of the array. Default is 0
* `stop`: end of the interval, not included
* `step`: optional, spacing between values. Default is 1

In [9]:
np.arange(0, 10, 1)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [10]:
np.arange(-1, 1, 0.1)

array([-1.00000000e+00, -9.00000000e-01, -8.00000000e-01, -7.00000000e-01,
       -6.00000000e-01, -5.00000000e-01, -4.00000000e-01, -3.00000000e-01,
       -2.00000000e-01, -1.00000000e-01, -2.22044605e-16,  1.00000000e-01,
        2.00000000e-01,  3.00000000e-01,  4.00000000e-01,  5.00000000e-01,
        6.00000000e-01,  7.00000000e-01,  8.00000000e-01,  9.00000000e-01])

**Ranges `linspace` and `logspace`**

The `linspace` and `logspace` functions also returns an array with evenly spaced values within a given interval. Instead of the step, we specifiy the number of values in the array.
```python
np.linspace([start, ]stop, [num])
np.logspace([start, ]stop, [num])
```
* `start`: the first value of the array
* `stop`: end of the interval, **included**
* `num`: optional, spacing between values. Default is 50

The `logspace` function allows specifying the logarithmic base using the `base` parameter.

In [11]:
np.linspace(0, 10, 25)

array([ 0.        ,  0.41666667,  0.83333333,  1.25      ,  1.66666667,
        2.08333333,  2.5       ,  2.91666667,  3.33333333,  3.75      ,
        4.16666667,  4.58333333,  5.        ,  5.41666667,  5.83333333,
        6.25      ,  6.66666667,  7.08333333,  7.5       ,  7.91666667,
        8.33333333,  8.75      ,  9.16666667,  9.58333333, 10.        ])

In [12]:
np.logspace(0, 10, 11, base=np.e)

array([1.00000000e+00, 2.71828183e+00, 7.38905610e+00, 2.00855369e+01,
       5.45981500e+01, 1.48413159e+02, 4.03428793e+02, 1.09663316e+03,
       2.98095799e+03, 8.10308393e+03, 2.20264658e+04])

**Zeroes and ones - `zeros`, `ones`**

Both function return an array of a given shape, filled with zeros or ones.

In [13]:
np.zeros((3, 4))

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [14]:
np.ones((4, 3))

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

Generally, we can use `full` to create an array, filled with the same value.

In [15]:
np.full((4,5), 7)

array([[7, 7, 7, 7, 7],
       [7, 7, 7, 7, 7],
       [7, 7, 7, 7, 7],
       [7, 7, 7, 7, 7]])

**Random number generator**

We can construct a generator using the function `default_rng([seed])`. As a parameter, you can enter a seed, which ensures the same results upon multiple executions.

In [16]:
rng = np.random.default_rng(42)

Uniformly distributed integers in a specified interval in an array of specified size.

In [17]:
rng.integers(20)

1

In [18]:
rng.integers(20, size=20)

array([15, 13,  8,  8, 17,  1, 13,  4,  1, 10, 19, 14, 15, 14, 15, 10,  2,
       16,  9, 10], dtype=int64)

In [19]:
rng.integers(10, 20, size=(2,5), dtype=np.uint8)

array([[15, 16, 19, 13, 16],
       [15, 17, 11, 13, 14]], dtype=uint8)

Uniformly distributed values in the half-open interval [0,1):

In [20]:
rng.random()

0.6438651200806645

In [21]:
rng.random((2,5))

array([[0.82276161, 0.4434142 , 0.22723872, 0.55458479, 0.06381726],
       [0.82763117, 0.6316644 , 0.75808774, 0.35452597, 0.97069802]])

Normally distributed values with mean 0 and variance 1:

In [22]:
rng.normal(size=(2, 5))

array([[-0.15452948, -0.42832782, -0.35213355,  0.53230919,  0.36544406],
       [ 0.41273261,  0.430821  ,  2.1416476 , -0.40641502, -0.51224273]])

##### Question 1-1-2

The `randn` function assumes the center of $\mu = 0$ and the standard deviation $\sigma = 1$. How to model an arbitrary center and standard deviation, e.g. $\mu=5$ and $\sigma=0.5$?

[Answer](201-1.ipynb#Answer-1-1-2)

#### Loading data from files

NumPy provides convenient functions to load data from files. Commonly used functions include `loadtxt()` and `genfromtxt()`, which can handle files with numerical data. These functions allow specifying delimiters, data types, and `genfromtxt()` can also handle missing values.

In [23]:
data = np.loadtxt('../data/stockholm.csv', delimiter=",", skiprows=1)
data

array([[ 1.756e+03,  1.000e+00,  1.000e+00, -8.700e+00],
       [ 1.756e+03,  1.000e+00,  2.000e+00, -9.200e+00],
       [ 1.756e+03,  1.000e+00,  3.000e+00, -8.600e+00],
       ...,
       [ 2.025e+03,  6.000e+00,  1.300e+01,  1.590e+01],
       [ 2.025e+03,  6.000e+00,  1.400e+01,  1.880e+01],
       [ 2.025e+03,  6.000e+00,  1.500e+01,  2.150e+01]])

### Differences between lists and arrays

The structure `numpy.ndarray` still looks like a lis of lists (of lists ...). What's the difference?

Some quick facts:
* Method of addressing values
* Typing
    * Python lists can contain any type of object that can vary within the list (**dynamic typing**). 
    * Arrays are **statically typed** and **homogeneous**. The data type of elements is determined at the time of creation.
* Lists do not support mathematical operations. Implementation of such operations would be very inefficient. Most computational operations for arrays are implemented in lower-level languages (Fortran, C). 

As a result, arrays are memory-efficient, since they occupy a fixed space in memory.

#### Addressing

Elements are addressed using square brackets, similar to lists.

`v` is a vector; we address it by its only dimension.

In [24]:
v = np.array([1, 2, 3, 4, 5])
v[0]

1

We use two pieces of data to address the matrix `data` - the address is now a tuple.

In [25]:
data[1,1]

1.0

Addressing one dimension first returns rows.

In [26]:
data[1]

array([ 1.756e+03,  1.000e+00,  2.000e+00, -9.200e+00])

By using `:` we say that we want all elements in the corresponding dimension. A row:

In [27]:
data[1, :]

array([ 1.756e+03,  1.000e+00,  2.000e+00, -9.200e+00])

How to implement access to the entire first column with lists? You will need some `for` loops. The addressing syntax substantially simplifies this.

In [28]:
data[:, 1]

array([1., 1., 1., ..., 6., 6., 6.])

Individual elements can be changed with assignment statements.

In [29]:
M = np.array([[1, 2], [3, 4]])
M[0, 0] = 9
M

array([[9, 2],
       [3, 4]])

We can set them by the whole dimension.

In [30]:
M[1, :] = 0
M[:, 1] = -1
M

array([[ 9, -1],
       [ 0, -1]])

We can assign a different value to each element in the selected dimension, but we must be careful about the size of the dimension.

In [31]:
M[0] = [2, 3]
M

array([[ 2,  3],
       [ 0, -1]])

In [32]:
M[:, 0] = [2, 3, 4]

ValueError: could not broadcast input array from shape (3,) into shape (2,)

##### Cutting
Cutting arrays is a common concept. An arbitrary sub-array is obtained by addressing `M[from:to:step]`.

* `start`: starting address of the array. Default value is 0
* `stop`: final address of the array, not included. Default value is the length of the array
* `step`: size of the step. Default value is 1

In [33]:
A = np.array([1, 2, 3, 4, 5])
A

array([1, 2, 3, 4, 5])

In [34]:
A[1:3]

array([2, 3])

We can also change the addressed sub-arrays.

In [35]:
A[1:3] = [-2, -3]
A

array([ 1, -2, -3,  4,  5])

Any of the cutting parameters may also be omitted. The default values are used.

In [36]:
A[::]

array([ 1, -2, -3,  4,  5])

Every other element:

In [37]:
A[::2]

array([ 1, -3,  5])

The first three elements:

In [38]:
A[:3]

array([ 1, -2, -3])

From the third element onwards:

In [39]:
A[3:]

array([4, 5])

Negative indices refer to the <i>end</i> of the array:

In [40]:
A[-1]

5

The last three elements:

In [41]:
A[-3:]

array([-3,  4,  5])

We can easily reverse an array:

In [42]:
A[-1::-1]

array([ 5,  4, -3, -2,  1])

Cutting also works in multi-dimensional fields.

In [43]:
A = np.array([[n+m*10 for n in range(5)] for m in range(5)])
A

array([[ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34],
       [40, 41, 42, 43, 44]])

In [44]:
A[1:4, 1:4]

array([[11, 12, 13],
       [21, 22, 23],
       [31, 32, 33]])

Elements can be skipped.

In [45]:
A[::2, ::2]

array([[ 0,  2,  4],
       [20, 22, 24],
       [40, 42, 44]])

##### Addressing arrays using a second structure
Arrays can also be addressed using other arrays or lists.

In [46]:
row_indices = [1, 2, 3]
A[row_indices]

array([[10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34]])

In [47]:
col_indices = [1, 2, -1]
A[row_indices, col_indices]

array([11, 22, 34])

We can also use *masks*. These are structures with `bool` data indicating whether or not the element in the corresponding location will be selected.

In [48]:
B = np.array([n for n in range(5)])
B

array([0, 1, 2, 3, 4])

In [49]:
row_mask = np.array([True, False, True, False, False])
B[row_mask]

array([0, 2])

A little different way of determining the mask.

In [50]:
row_mask = np.array([1, 0, 1, 0, 0], dtype=bool)
B[row_mask]

array([0, 2])

This method can be used to conditionally address elements according to their content.

In [51]:
x = np.array([0, 4, 2, 2, 3, 7, 10, 12, 15, 28])
x

array([ 0,  4,  2,  2,  3,  7, 10, 12, 15, 28])

In [52]:
mask = (5 < x) * (x < 12.3)
mask

array([False, False, False, False, False,  True,  True,  True, False,
       False])

In [53]:
x[mask]

array([ 7, 10, 12])

##### Question 1-1-3

Test combinations of all already mentioned addressing methods. Address at the same time, for example, lines with cutting and columns with conditional addressing. Creates more than a two-dimensional structure. Make sure you understand the result of each addressing.

[Answer](201-1.ipynb#Answer-1-1-3)

#### Typing
Determine the type of elements in the current array:

In [54]:
M.dtype

dtype('int32')

Inserting any type of data into the array can lead to problems. Try:

In [55]:
M[0,0] = "hello"

ValueError: invalid literal for int() with base 10: 'hello'

Set the data type when creating an array, for example, complex numbers:

In [56]:
M = np.array([[1, 2, 3], [1, 4, 9]], dtype=complex)
M

array([[1.+0.j, 2.+0.j, 3.+0.j],
       [1.+0.j, 4.+0.j, 9.+0.j]])

Let's change the type of elements in the array during execution:

In [57]:
M = M.astype(float)
M

  M = M.astype(float)


array([[1., 2., 3.],
       [1., 4., 9.]])

We can use data types: `int`, `float`, `complex`, `bool`, `object`.

Sizes in bits can be explicitly specified: `int64`, `int16`, `float128`, `complex128`.

### Basic computational operations

The key to using interpreted languages is to make the most of the vector operations. Avoid excessive use of loops. As many operations as possible are implemented as operations between matrices and vectors, for example, as vector or matrix multiplication.

#### Array operations with scalar

We use the usual arithmetic operations for multiplication, addition, and division with scalars.

In [58]:
v1 = np.arange(0, 5)

In [59]:
v1 * 2

array([0, 2, 4, 6, 8])

In [60]:
v1 + 2

array([2, 3, 4, 5, 6])

In [61]:
A * 2, A + 2

(array([[ 0,  2,  4,  6,  8],
        [20, 22, 24, 26, 28],
        [40, 42, 44, 46, 48],
        [60, 62, 64, 66, 68],
        [80, 82, 84, 86, 88]]),
 array([[ 2,  3,  4,  5,  6],
        [12, 13, 14, 15, 16],
        [22, 23, 24, 25, 26],
        [32, 33, 34, 35, 36],
        [42, 43, 44, 45, 46]]))

#### Array-array operations (elements-wise)

Operations between multiple fields are by default executed element-wise. For example, element-wise multiplication is achieved using the `*` operator.

In [62]:
A * A

array([[   0,    1,    4,    9,   16],
       [ 100,  121,  144,  169,  196],
       [ 400,  441,  484,  529,  576],
       [ 900,  961, 1024, 1089, 1156],
       [1600, 1681, 1764, 1849, 1936]])

In [63]:
v1 * v1

array([ 0,  1,  4,  9, 16])

Attention, array dimensions must match.

In [64]:
A.shape, v1.shape

((5, 5), (5,))

In [65]:
A * v1

array([[  0,   1,   4,   9,  16],
       [  0,  11,  24,  39,  56],
       [  0,  21,  44,  69,  96],
       [  0,  31,  64,  99, 136],
       [  0,  41,  84, 129, 176]])

### Iteration through array elements

We try to stick to the principle of avoiding using loops over the array elements. The reason is the slow implementation of loops in interpreted languages, such as Python.
Sometimes, however, we can not avoid loops. Loop `for` is a meaningful solution.

In [66]:
v = np.array([1,2,3,4])

for element in v:
    print(element)

1
2
3
4


In [67]:
M = np.array([[1,2], [3,4]])

for row in M:
    print("row", row)
    
    for element in row:
        print(element)

row [1 2]
1
2
row [3 4]
3
4


The `enumerate` generator is used when we want to iterate through elements and possibly change their values.

In [68]:
for i, row in enumerate(M):
    print("row index", i, "row", row)
    
    for j, element in enumerate(row):
        print("col index", j, "element", element)
       
        # Kvadriramo vsakega od elementov 
        M[i, j] = element ** 2

row index 0 row [1 2]
col index 0 element 1
col index 1 element 2
row index 1 row [3 4]
col index 0 element 3
col index 1 element 4


We get an array where each element is a square of the original value.

In [69]:
M

array([[ 1,  4],
       [ 9, 16]])

### Example: Stockholm temperatures

We will use the `numpy` library on the case of daytime temperature data in Stockholm. Data includes metrics for each day between 1800 and 2011. They are stored in a file where the lines represent measurements. Individual data - year, month, day and measured temperature - are separated by comma.

In [70]:
data = np.loadtxt('../data/stockholm.csv', delimiter=",", skiprows=1)

Check the data size: the number of lines (_measurements_, _samples_) and the number of columns (_attributes_).

In [71]:
data.shape

(98409, 4)

Columns store data in this order: `year`, `month`, `day` and `temperature`.

Let's take a look at all the measurements made in 2011. We create the binary vector `data [:, 0] == 2011`, which contains the `True` value on the relevant positions and is used to address the data.

In [72]:
data[data[:, 0] == 2011]

array([[ 2.011e+03,  1.000e+00,  1.000e+00, -2.300e+00],
       [ 2.011e+03,  1.000e+00,  2.000e+00, -3.600e+00],
       [ 2.011e+03,  1.000e+00,  3.000e+00, -6.900e+00],
       ...,
       [ 2.011e+03,  1.200e+01,  2.900e+01,  4.900e+00],
       [ 2.011e+03,  1.200e+01,  3.000e+01,  6.000e-01],
       [ 2.011e+03,  1.200e+01,  3.100e+01, -2.600e+00]])

##### Question 1-1-4

Print out the temperature on a chosen date.

[Answer](201-1.ipynb#Answer-1-1-4)

#### Data Processing

Let's introduce operations that tell us something about the data. We will calculate some basic statistics.

##### Average, arithmetic mean

Daily temperature is in column with index 3 (fourth column). Calculate the average of all measurements.

In [73]:
np.mean(data[:,3])

6.26429086770519

We find that the average daily temperature in Stockholm over the past 200 years was pleasant 6.2° C.

##### Question 1-1-5

What is the average temperature in January (month with the number 1)?

[Answer](202-1.ipynb#Answer-1-1-5)

#### Standard deviation and variance

In [74]:
np.std(data[:,3]), np.var(data[:,3])

(8.38204530528587, 70.25868349986487)

#### Minimum and maximum value

Check the year range in the data.

In [75]:
y = data[:, 0]
y_min = y.min()
y_max = y.max()
print("%i–%i" % (y_min, y_max))
print(int(y_max - y_min))

1756–2025
269


Let's find the lowest daily temperature:

In [76]:
data[:,3].min()

-27.7

Let's find the highest daily temperature:

In [77]:
data[:,3].max()

28.3

##### Question 1-1-6

In what month is the temperature deviation the biggest?

[Answer](201-1.ipynb#Answer-1-1-6)

##### Question 1-1-7

The month and year when the maximum temperature was recorded.

[Answer](201-1.ipynb#Answer-1-1-7)

#### Sum, product

Temperature is usually not multiplied. Nevertheless, take the opportunity to see the functions of the sum and the product.

In [78]:
data[:, 3].sum() 

616462.6000000001

In [79]:
data[:, 3].sum() / data.shape[0] 

6.26429086770519

In [80]:
np.prod(data[0, :])

-15277.199999999999

#### Global warming?

Rumors circulate in Stockholm that the temperature is increasing from year to year. Let's check if this is true.

First we calculate the average temperature for each year. To do this, we'll use a condition in the address.

In [81]:
years = np.unique(data[:, 0])[:-1].astype(int)
yearly_avg = np.array([data[data[:, 0] == year, 3].mean() for year in years])

In [82]:
hottest_idx = np.argmax(yearly_avg)
coldest_idx = np.argmin(yearly_avg)
print("Hottest year:", int(years[hottest_idx]), "Avg temp:", yearly_avg[hottest_idx])
print("Coldest year:", int(years[coldest_idx]), "Avg temp:", yearly_avg[coldest_idx])

Hottest year: 2020 Avg temp: 9.75464480874317
Coldest year: 1867 Avg temp: 3.2309589041095896


##### Question 1-1-8

Write years when the average temperature is higher than last year.

[Answer](201-1.ipynb#Answer-1-1-8)

##### Question 1-1-9

Find the 10 warmest years.

[Answer](201-1.ipynb#Answer-1-1-9)

Let's also take a look at how the averages of longer time periods have changed over time.

In [83]:
avg_before_1900 = np.mean(yearly_avg[years < 1900])
avg_after_2000 = np.mean(yearly_avg[years >= 2000])
diff = avg_after_2000 - avg_before_1900

print(f"Average temp before 1900: {avg_before_1900:.2f}")
print(f"Average temp after 2000: {avg_after_2000:.2f}")
print(f"Change: {diff:.2f}°C")


Average temp before 1900: 5.75
Average temp after 2000: 8.20
Change: 2.45°C


In [84]:
intervals = np.arange(1750, 2051, 50)
for i in range(len(intervals) - 1):
    start = intervals[i]
    end = intervals[i + 1]
    mask = (data[:, 0] >= start) & (data[:, 0] < end)
    avg_temp = data[mask, 3].mean()
    print(f"Avg temp {start}–{end}: {avg_temp:.2f} °C")

Avg temp 1750–1800: 5.94 °C
Avg temp 1800–1850: 5.64 °C
Avg temp 1850–1900: 5.69 °C
Avg temp 1900–1950: 6.26 °C
Avg temp 1950–2000: 6.78 °C
Avg temp 2000–2050: 8.16 °C


With the `numpy` library, we can also find a polynomial of the desired degree that best fits our data and take a look into the future.

In [85]:
from numpy.polynomial import Polynomial as P
p = P.fit(years, yearly_avg, 3).convert()
print(f"Predicted average in 2100: {p(2100):.2f}°C")

Predicted average in 2100: 12.62°C
