# Priprava podatkov

# Data preparation

## Knjižnica `numpy`

## Library `numpy`

Knjižnica `numpy` \cite{numpy} omogoča numerično računanje v jeziku Python. Vsebuje učinkovite implementacije podatkovnih struktur kot so vektorji, matrike in polja. Vse podatkovne strukture izhajajo iz podatkovnega tipa polje ($\texttt{array}$). Večina računsko zahtevnih operacij je implementiranih v nižjenivojskih jezikih (Fortran, C). Polje lahko ustvarimo na različne načine:

*  s pretvorbo Pythonovih seznamov ali terk,
*  z uporabo funkcij $\texttt{arange}$, $\texttt{linspace}$ in podobnih,
*  z branjem podatkov iz datotek.

The `numpy` \cite{numpy} library provides numerical computing in Python. It contains effective implementation of data structures such as vectors, matrices, and arrays. All data structures are derived from the data type $\texttt{array}$. Most computational operations are implemented in lower-level languages (Fortran, C). We can create an array in different ways:

* by converting Python lists or tuples,
* using the functions $\texttt{arange}$, $\texttt{linspace}$, and the like,
* by reading data from files.

In [1]:
import numpy as np

### Pretvorba seznamov v večdimenzionalna polja

Konstruktor $\texttt{array}$ uporabimo neposredno tako, da podamo seznam.
Če podamo seznam števil, dobimo vektor:

### Conversion of lists into multi-dimensional arrays

We use the constructor $\texttt{array}$ directly by submitting a list.
If we give a list of numbers, we get a vector:

In [2]:
v = np.array([1, 2, 3, 4])
v

array([1, 2, 3, 4])

Če podamo seznam seznamov, dobimo matriko:

If we give a list of lists, we get a matrix:

In [3]:
M = np.array([[1, 2], [3, 4]])
M

array([[1, 2],
       [3, 4]])

Neglede na obliko, sta objekta $\texttt{v}$ in $\texttt{M}$ tipa $\texttt{ndarray}$.

Regardless of the shape, the objects $\texttt{v}$ and $\texttt{M}$ are of type $\texttt{ndarray}$.

In [4]:
type(v), type(M)

(numpy.ndarray, numpy.ndarray)

Razlika je v njunih dimenzijah. Objekt $\texttt{v}$ je vektor s štirimi elementi, $\texttt{M}$ pa matrika `2 x 2`.

The difference is in their dimensions. The object $\texttt{v}$ is a vector with four elements, and $\texttt{M}$ is a `2 x 2` matrix.

In [5]:
v.shape

(4,)

In [6]:
M.shape

(2, 2)

Podobno lahko izpišemo število elementov v celotnem seznamu.

Similarly, we can display the number of items in the entire list.

In [7]:
M.size

4

##### Vprašanje 1-1-1

Sestavimo lahko polja poljubnih dimenzij. Poskusi sestaviti seznam-seznamov-seznamov(-seznamov, ...) in preveri, kakšne so njegove dimenzije!

##### Question 1-1-1

We can compose arrays of any dimension. Try to create a list of lists (of lists, ...) and check out what its dimensions are!

In [8]:
# Sestavi strukturo poljubnih dimenzij in preveri njeno dimenzijo in velikost
# X = 

[Odgovor](201-1.ipynb#Odgovor-1-1-1)

[Answer](201-1.ipynb#Answer-1-1-1)

### Razlike med seznami in polji

Struktura  `numpy.ndarray` še vedno izgleda kot seznam-seznamov(-seznamov, ...). V čem je razlika?

Nekaj hitrih dejstev:
* Pythonovi seznami lahko vsebujejo poljuben tip objektov, ki se znotraj seznama lahko razlikujejo (dinamično tipiziranje). Ne podpirajo matematičnih operacij, kot so matrično množenje. Implementacija takih opracij nad seznamom bi bila zaradi dinamičnega tipiziranja zelo neučinkovita.
* Polja so **statično tipizirana** in **homogena**. Podatkovni tip elementov je določen ob nastanku.
* Posledično so polja pomnilniško učinkovita, saj zasedajo zvezen prostor v pomnilniku.

Ugotovimo tip elementov v trenutnem polju:

### Differences between lists and arrays

The structure `numpy.ndarray` still looks like a lis of lists (of lists, ...). What's the difference?

Some quick facts:

* Python lists can contain any type of object that can vary within the list (dynamic typing). They do not support mathematical operations such as matrix multiplication. Implementation of such operations would be very inefficient due to dynamic typing.
* Arrays are **statically typed** and **homogeneous**. The data type of elements is determined at the time of creation.
* As a result, arrays are memory-efficient, since they occupy a fixed space in memory.

Determine the type of elements in the current array:

In [9]:
M.dtype

dtype('int64')

Vstavljanje podatkov poljubnih tipov v polje lahko vodi do težav. Poskusi:

Inserting any type of data into the array can lead to problems. Try:

In [10]:
M[0,0] = "hello"

ValueError: invalid literal for int() with base 10: 'hello'

Nastavimo podatkovni tip ob ustvarjanju polja, n.pr., kompleksna števila:

Set the data type when creating an array, for example, complex numbers:

In [11]:
M = np.array([[1, 2, 3], [1, 4, 9]], dtype=complex)

M

array([[ 1.+0.j,  2.+0.j,  3.+0.j],
       [ 1.+0.j,  4.+0.j,  9.+0.j]])

Med izvanjanjem spremenimo tip zapisov v polju:

Let's change the type of elements in the array during execution:

In [12]:
M = M.astype(float)
M

  """Entry point for launching an IPython kernel.


array([[ 1.,  2.,  3.],
       [ 1.,  4.,  9.]])

Uporabimo lahko podatkovne tipe: `int`, `float`, `complex`, `bool`, `object`.

Velikosti, v bitih, lahko podamo eksplicitno: `int64`, `int16`, `float128`, `complex128`.

We can use data types: `int`, `float`, `complex`, `bool`, `object`.

Sizes, in bits, can be explicitly given: `int64`,` int16`, `float128`,` complex128`.

### Uporaba polj

Najprej si oglejmo načine uporabe polj.

### Using arrays

First, let's take a look at how to use arrays.

#### Naslavljanje

#### Addressing

Elemente naslavljamo z uporabo oglatih oklepajev, podobno kot pri seznamih.

Elements are addressed using square brackets, similar to lists.

In [13]:
# v je vektor; naslavljamo ga po njegovi edini dimenziji
v[0]

1

In [14]:
# matriko M naslavljamo z dvema podatkoma - naslov je sedaj terka 
M[1,1]

4.0

Naslavljanje po eni dimenziji vrne najprej vrstice.

Addressing one dimension first returns rows.

In [15]:
M[1]

array([ 1.,  4.,  9.])

Z uporabo `:` povemo, da bi radi vse elemente v pripadajoči dimenziji. Kako bi dostop do celotnega prvega stolpca implementirali s seznami? Potrebnih bi bilo nekaj zank `for`. Sintaksa naslavljanja to bistveno poenostavi.

By using `:` we say that we want all elements in the corresponding dimension. How to implement access to the entire first column with lists? You will need some `for` loops. The addressing syntax substantially simplifies this.

In [16]:
M[1, :] # Vrstica

array([ 1.,  4.,  9.])

In [17]:
M[:, 1] #  Stolpec, precej enostavno.

array([ 2.,  4.])

Posamezne elemente lahko spreminjamo s prireditvenimi stavki.

Individual elements can be changed with assignment statements.

In [18]:
M[0, 0] = 9

In [19]:
M

array([[ 9.,  2.,  3.],
       [ 1.,  4.,  9.]])

Lahko jih nastavljamo po celotni dimenziji.

We can set them by the whole dimension.

In [20]:
M[1, :] = 0
M[:, 2] = -1

In [21]:
M

array([[ 9.,  2., -1.],
       [ 0.,  0., -1.]])

#### Rezanje

#### Cutting

Rezanje polj je pogost koncept. Poljubno pod-polje dobimo z naslavljanjem `M[od:do:korak]`:

Cutting arrays is a common concept. An arbitrary sub-array is obtained by addressing `M[from:to:step]`:

In [22]:
A = np.array([1, 2, 3, 4, 5])
A

array([1, 2, 3, 4, 5])

In [23]:
A[1:3]

array([2, 3])

Naslovljena pod-polja lahko tudi spreminjamo.

We can also change the addressed sub-arrays.

In [24]:
A[1:3] = [-2, -3]
A

array([ 1, -2, -3,  4,  5])

Katerikoli od parametrov rezanja je lahko tudi izpuščen.

Any of the cutting parameters may also be omitted.

In [25]:
A[::] # Privzete vrednosti parametrov od:do:korak.

array([ 1, -2, -3,  4,  5])

In [26]:
A[::2] # korak velikosti 2

array([ 1, -3,  5])

In [27]:
A[:3] # prvi trije elementi

array([ 1, -2, -3])

In [28]:
A[3:] # elementi od tretjega naprej

array([4, 5])

Negativni indeksi se nanašajo na <i>konec</i> polja:

Negative indices refer to the <i>end</i> of the array:

In [29]:
A = np.array([1, 2, 3, 4, 5])

In [30]:
A[-1]

5

Zadnji trije elementi:

The last three elements:

In [31]:
A[-3:]

array([3, 4, 5])

Rezanje deluje tudi pri večdimenzionalnih poljih.

Cutting also works in multi-dimensional fields.

In [32]:
A = np.array([[n+m*10 for n in range(5)] for m in range(5)])
A

array([[ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34],
       [40, 41, 42, 43, 44]])

In [33]:
# pod-polje izvirnega polja A
A[1:4, 1:4]

array([[11, 12, 13],
       [21, 22, 23],
       [31, 32, 33]])

Elemente lahko preskakujemo.

Elements can be skipped.

In [34]:
A[::2, ::2]

array([[ 0,  2,  4],
       [20, 22, 24],
       [40, 42, 44]])

#### Naslavljanje polja s pomočjo druge strukture

#### Addressing arrays using a second structure

Polje naslavljamo tudi s pomočjo drugih polj ali seznamov.

Arrays can also be addressed using other arrays or lists.

In [35]:
row_indices = [1, 2, 3]
A[row_indices]

array([[10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34]])

In [36]:
col_indices = [1, 2, -1]
A[row_indices, col_indices]

array([11, 22, 34])

Uporabljamo tudi *maske*. Le-te so strukture s podatki tipa `bool`, ki nakazujejo, ali bo element na pripadajočem mestu izbran ali ne.

We can also use *masks*. These are structures with `bool` data indicating whether or not the element in the corresponding location will be selected.

In [37]:
B = np.array([n for n in range(5)])
B

array([0, 1, 2, 3, 4])

In [38]:
row_mask = np.array([True, False, True, False, False])
B[row_mask]

array([0, 2])

Malenkost drugačen način določanja maske.

A little different way of determining the mask.

In [39]:
row_mask = np.array([1, 0, 1, 0, 0], dtype=bool)
B[row_mask]

array([0, 2])

Način lahko uporabimo za pogojno naslavljanje elementov glede na njihovo vsebino.

This method can be used to conditionally address elements according to their content.

In [40]:
x = np.array([0, 4, 2, 2, 3, 7, 10, 12, 15, 28])
x

array([ 0,  4,  2,  2,  3,  7, 10, 12, 15, 28])

In [41]:
mask = (5 < x) * (x < 12.3)
mask

array([False, False, False, False, False,  True,  True,  True, False, False], dtype=bool)

In [42]:
x[mask]

array([ 7, 10, 12])

##### Vprašanje 1-1-2

Preizkusi kombinacije vseh do sedaj omenjenih načinov naslavljanja naenkrat. Hkrati naslavljaj, npr., vrstice z rezanjem, stolpce pa s pogojnim naslavljanjem. Ustvari več kot dvo-dimenzionalne strukture. Preveri, ali razumeš rezultat vsakega od naslavljanj.

##### Question 1-1-2

Test combinations of all already mentioned addressing methods. Address at the same time, for example, lines with cutting and columns with conditional addressing. Creates more than a two-dimensional structure. Make sure you understand the result of each addressing.

In [43]:
# Preizkusi več načinov naslavljanja hkrati.
A[A[:, 0]>10, 0:2 ]
# ...
# ...

array([[20, 21],
       [30, 31],
       [40, 41]])

[Odgovor](201-1.ipynb#Odgovor-1-1-2)

[Answer](201-1.ipynb#Answer-1-1-2)

#### Funkcije za ustvarjanje polj

#### Functions for creating arrays

Knjižica `numpy` vsebuje funkcije za ustvarjanje pogostih tipov polj. Poglejmo nekaj primerov.

The `numpy` library contains functions for generating common array types. Let's look at some examples.

**Razpon `arange`**

**The `arange` range**

In [44]:
np.arange(0, 10, 1) # od, do, korak

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [45]:
np.arange(-1, 1, 0.1)

array([ -1.00000000e+00,  -9.00000000e-01,  -8.00000000e-01,
        -7.00000000e-01,  -6.00000000e-01,  -5.00000000e-01,
        -4.00000000e-01,  -3.00000000e-01,  -2.00000000e-01,
        -1.00000000e-01,  -2.22044605e-16,   1.00000000e-01,
         2.00000000e-01,   3.00000000e-01,   4.00000000e-01,
         5.00000000e-01,   6.00000000e-01,   7.00000000e-01,
         8.00000000e-01,   9.00000000e-01])

**Razpona `linspace` in `logspace`**

**Ranges `linspace` and `logspace`**

Pozor: začetna in končna točka sta tudi vključeni.

Attention: the start and end points are also included.

In [46]:
np.linspace(0, 10, 25) # od, do, stevilo med sabo enako oddaljenih tock

array([  0.        ,   0.41666667,   0.83333333,   1.25      ,
         1.66666667,   2.08333333,   2.5       ,   2.91666667,
         3.33333333,   3.75      ,   4.16666667,   4.58333333,
         5.        ,   5.41666667,   5.83333333,   6.25      ,
         6.66666667,   7.08333333,   7.5       ,   7.91666667,
         8.33333333,   8.75      ,   9.16666667,   9.58333333,  10.        ])

In [47]:
np.logspace(0, 10, 11, base=np.e) # Poskusi z drugo osnovo (bazo): 2, 3, 10

array([  1.00000000e+00,   2.71828183e+00,   7.38905610e+00,
         2.00855369e+01,   5.45981500e+01,   1.48413159e+02,
         4.03428793e+02,   1.09663316e+03,   2.98095799e+03,
         8.10308393e+03,   2.20264658e+04])

**Naključna polja, modul `numpy.random`**

**Random arrays, `numpy.random` module**

In [48]:
from numpy import random
random.seed(42)  # zagotovi ponovljivost naključnih rezultatov

Enakomerno (uniformno) porazdeljene vrednosti v intervalu [0,1]:

Uniformly distributed values in the interval [0,1]:

In [49]:
random.rand(5, 5)

array([[ 0.37454012,  0.95071431,  0.73199394,  0.59865848,  0.15601864],
       [ 0.15599452,  0.05808361,  0.86617615,  0.60111501,  0.70807258],
       [ 0.02058449,  0.96990985,  0.83244264,  0.21233911,  0.18182497],
       [ 0.18340451,  0.30424224,  0.52475643,  0.43194502,  0.29122914],
       [ 0.61185289,  0.13949386,  0.29214465,  0.36636184,  0.45606998]])

Normalno porazdeljene vrednosti s povprečno vrednostjo 0 in odklonom 1:

Normally distributed values with mean 0 and variance 1:

In [50]:
random.randn(5, 5)

array([[-0.62947496,  0.59772047,  2.55948803,  0.39423302,  0.12221917],
       [-0.51543566, -0.60025385,  0.94743982,  0.291034  , -0.63555974],
       [-1.02155219, -0.16175539, -0.5336488 , -0.00552786, -0.22945045],
       [ 0.38934891, -1.26511911,  1.09199226,  2.77831304,  1.19363972],
       [ 0.21863832,  0.88176104, -1.00908534, -1.58329421,  0.77370042]])

**Diagonalna matrika `diag`**

**The diagonal matrix `diag`**

Na diagonali naj bodo vrednosti 1, 2 in 3.

The diagonal should contain 1, 2, and 3.

In [51]:
np.diag([1, 2, 3])

array([[1, 0, 0],
       [0, 2, 0],
       [0, 0, 3]])

Diagonala naj bo odmaknjena od glavne diagonale za k mest. Pozor, dimenzija matrike se temu ustrezno poveča.

The diagonal should be removed from the main diagonal for k places. Attention, the dimension of the matrix increases accordingly.

In [52]:
np.diag([1, 2, 3], k=1) 

array([[0, 1, 0, 0],
       [0, 0, 2, 0],
       [0, 0, 0, 3],
       [0, 0, 0, 0]])

**Ničle in enice -  `zeros`, `ones`**

**Zeroes and ones - `zeros`, `ones`**

In [53]:
np.zeros((3, 3))

array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

In [54]:
np.ones((3, 3))

array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

### Osnovne računske operacije

### Basic computational operations

Ključno pri uporabi iterpretiranih jezikov je, da kar najbolj izkoriščamo vektorske operacije. Izogibajmo se odvečni uporabi zank. Karseda veliko operacij implementiramo kot operacije med matrikami in vektorji, npr., kot vektorsko ali matrično množenje.

The key to using interpreted languages is to make the most of the vector operations. Avoid excessive use of loops. As many operations as possible are implemented as operations between matrices and vectors, for example, as vector or matrix multiplication.

#### Operacije polja s skalarjem

Uporabimo običajne aritmetične operacije za množenje, seštevanje in deljenje s skalarjem.

#### Array operations with scalar

We use the usual arithmetic operations for multiplication, addition, and division with scalars.

In [55]:
v1 = np.arange(0, 5)

In [56]:
v1 * 2

array([0, 2, 4, 6, 8])

In [57]:
v1 + 2

array([2, 3, 4, 5, 6])

In [58]:
A * 2, A + 2

(array([[ 0,  2,  4,  6,  8],
        [20, 22, 24, 26, 28],
        [40, 42, 44, 46, 48],
        [60, 62, 64, 66, 68],
        [80, 82, 84, 86, 88]]), array([[ 2,  3,  4,  5,  6],
        [12, 13, 14, 15, 16],
        [22, 23, 24, 25, 26],
        [32, 33, 34, 35, 36],
        [42, 43, 44, 45, 46]]))

####  Operacije polje-polje (po elementih)

Operacije med več polji se privzeto obravnavajo po elementih. Na primer, množenje po elementih dosežemo z uporabo operatorja `*`.

#### Array-array operations (elements-wise)

Operations between multiple fields are by default executed element-wise. For example, element-wise multiplication is achieved using the `*` operator.

In [59]:
A * A

array([[   0,    1,    4,    9,   16],
       [ 100,  121,  144,  169,  196],
       [ 400,  441,  484,  529,  576],
       [ 900,  961, 1024, 1089, 1156],
       [1600, 1681, 1764, 1849, 1936]])

In [60]:
v1 * v1

array([ 0,  1,  4,  9, 16])

Pozor, dimenzije polj se morajo ujemati.

Attention, array dimensions must match.

In [61]:
A.shape, v1.shape

((5, 5), (5,))

In [62]:
A * v1

array([[  0,   1,   4,   9,  16],
       [  0,  11,  24,  39,  56],
       [  0,  21,  44,  69,  96],
       [  0,  31,  64,  99, 136],
       [  0,  41,  84, 129, 176]])

### Iteracija po elementih polja

Skušamo se držati načela, da se izogibamo uporabi zank preko elementov polja. Razlog je počasna implementacija zank v intepretiranih jezikih, kot je Python.
Včasih pa se zankam ne moremo izogniti. Zanka `for` je smiselna rešitev.  

### Iteration through array elements

We try to stick to the principle of avoiding using loops over the array elements. The reason is the slow implementation of loops in interpreted languages, such as Python.
Sometimes, however, we can not avoid loops. Loop `for` is a meaningful solution.

In [63]:
v = np.array([1,2,3,4])

for element in v:
    print(element)

1
2
3
4


In [64]:
M = np.array([[1,2], [3,4]])

for row in M:
    print("row", row)
    
    for element in row:
        print(element)

row [1 2]
1
2
row [3 4]
3
4


Generator `enumerate` uporabimo kadar želimo iteracijo po elementih in morebitno spreminjanje njihovih vrednosti.

The `enumerate` generator is used when we want to iterate through elements and possibly change their values.

In [65]:
for i, row in enumerate(M):
    print("row index", i, "row", row)
    
    for j, element in enumerate(row):
        print("col index", j, "element", element)
       
        # Kvadriramo vsakega od elementov 
        M[i, j] = element ** 2

row index 0 row [1 2]
col index 0 element 1
col index 1 element 2
row index 1 row [3 4]
col index 0 element 3
col index 1 element 4


Dobimo polje, kjer je vsak element kvadrat prvotne vrednosti.

We get an array where each element is a square of the original value.

In [66]:
M

array([[ 1,  4],
       [ 9, 16]])

Več o knjižnici `numpy` lahko preberete v \cite{numpy, numpyweb, numpytut, numpymatlab}.

Learn more about the numpy library in \cite{numpy, numpyweb, numpytut, numpymatlab}.