# 2. Arrays - Part 2

In [1]:
import numpy

## Structured arrays

A structured array consists of a number of columns, where each column can be a different datatype. 

Full information about structured arrays: 
http://docs.scipy.org/doc/numpy-1.10.1/user/basics.rec.html#structured-arrays

One of the possible ways to specify a structured array is to use a list of tuples as `dtype`:
For every column in the array a tuple is specified with the name of the column and the type of data in it. For example: 

In [2]:
dtype = [('Name', 'U10'), ('Country', 'U10'), ('Area', 'float64')]

The content of the array can then be given as a list of tuples, like so:

In [3]:
city = numpy.array([('Amsterdam', 'Netherlands', 219.3),
                    ('Paris',     'France',      105.4 ),
                    ('Barcelona', 'Spain',       101.9 )],
                     dtype=dtype)
print(city)

[('Amsterdam', 'Netherland', 219.3) ('Paris', 'France', 105.4)
 ('Barcelona', 'Spain', 101.9)]


#### Question
What happens when you use list of lists instead of list of tuples to build a structured array?

### Indexing structured arrays
The rows in a structured array can be accessed by regular indexing. The columns of the array by using the column names that are specified when the array was created.

In [4]:
# Access first row
print(city[0])

# Access first two rows
print(city[0:2])

# Access column by name
print(city['Area'])

# Access two columns using list of names
print(city[['Name', 'Area']])

# Print information about the array
print(city.shape, city.dtype)

('Amsterdam', 'Netherland', 219.3)
[('Amsterdam', 'Netherland', 219.3) ('Paris', 'France', 105.4)]
[ 219.3  105.4  101.9]
[('Amsterdam', 219.3) ('Paris', 105.4) ('Barcelona', 101.9)]
(3,) [('Name', '<U10'), ('Country', '<U10'), ('Area', '<f8')]


Note that this structured array, even though it has rows and columns, 
is treated as one-dimensional.

### Accessing and modifying column names

For example:


In [5]:
city.dtype.names

('Name', 'Country', 'Area')

In [6]:
city.dtype.names = ('name', 'country', 'area')
print(city['area'])

[ 219.3  105.4  101.9]


### Loading data into structured arrays

Structured arrays are useful for loading and working with tabular data with heterogeneous column types. 

#### Exercise 2b.1

Complete the following code loading the data from file [populations.txt](populations.txt). Load the year column as an `int`, and the other columns as `float`.

In [None]:
dtype = [('year',  ...
         ('hare',  ...
         ...
          ] 
population = numpy.loadtxt("populations.txt", dtype=...)

An alternative way of loading tabular data using `genfromtxt`:

In [7]:
population = numpy.genfromtxt("populations.txt", 
             names=True,
             dtype=['int','float','float','float'])
# Access lynx column

print(population['lynx'])

[  4000.   6100.   9800.  35200.  59400.  41700.  19000.  13000.   8300.
   9100.   7400.   8000.  12300.  19500.  45700.  51100.  29700.  15800.
   9700.  10100.   8600.]


### Record arrays
The is a special interface to structyred arrays called **record arrays**. For details, see https://docs.scipy.org/doc/numpy-1.10.1/user/basics.rec.html#record-arrays

## Array Indexing

For complete information  about indexing see
http://docs.scipy.org/doc/numpy/user/basics.indexing.html

You have already seen how to access content of the array by using an index for each dimension. This method is know as matrix indexing. In addition to matrix indexing, there are other ways to address content in an array

- Linear indexing transform the n-dimensional array to a 1-dimensional list. This linear index is returned when the `argmin` and `argmax` function are applied to an n-dimensional array. 

In [8]:
a = numpy.random.uniform(-1,1,(5,5))
print(a)
# Return the index of the maximum value
numpy.argmax(a)

[[ 0.29493108 -0.70327901  0.97174343 -0.62764773  0.22290627]
 [-0.35018568  0.93124911 -0.47113303  0.37153914  0.37635051]
 [-0.76357905 -0.6857155   0.16636888 -0.26510055  0.94908343]
 [-0.39150909  0.90375192 -0.49675443  0.62174875  0.52339078]
 [ 0.83510306  0.40782966  0.66064087 -0.90445933  0.7370534 ]]


2

- Boolean indexing, which returns all values in the array for which the index is True.

In [9]:
# Create a boolean index for positive numbers in array a
index = a > 0.0
print(index)
# Return all the positive numbers
print(a[index])

[[ True False  True False  True]
 [False  True False  True  True]
 [False False  True False  True]
 [False  True False  True  True]
 [ True  True  True False  True]]
[ 0.29493108  0.97174343  0.22290627  0.93124911  0.37153914  0.37635051
  0.16636888  0.94908343  0.90375192  0.62174875  0.52339078  0.83510306
  0.40782966  0.66064087  0.7370534 ]


- Indexing with an array of indices. In this case you specify a separate array in which you store the indices as integers and you will return exactly the elements of the array with these indices. 

In [10]:
b = numpy.linspace(0,1,10)
print(b)
# Return numbers at prime indices
index = numpy.array([ 2, 3, 5, 7])
print(b[index])

[ 0.          0.11111111  0.22222222  0.33333333  0.44444444  0.55555556
  0.66666667  0.77777778  0.88888889  1.        ]
[ 0.22222222  0.33333333  0.55555556  0.77777778]


### Linear and matrix indexing

Indexing in a 1-dimensional matrix is the same as the indexing in a Python list. And if you want to apply something to every element of the array then one simple for-loop over the items can do the trick.

Indexing in a n-dimensional matrix has one index for every dimension. To access one element of the array, the index of every dimension should be given. When accessing more than one element, the slicing `":"` can be used, and this works similar as it works with lists, but then you can use the `":"` for every dimension. If no index is given for a dimension, then the `":"` will be given.
If the index is `[a:b]` then indices that are used are `a` up to but not including `b`.

If you have the linear index and you want to convert it to a matrix index, you can use the function `numpy.unravel_index()`.

The first argument is the linear index and the second argument is the shape of the array for which you want to transform the index. For example: `numpy.unravel_index(linear_index, (2,3))`. 

In [11]:
# indexing in a 3-dimensional array
z = numpy.arange(24).reshape((2, 3, 4))
print(z)

# slices
print(z[0:2, 1:3, 3])
print(z[:, 2, :])

# linear indexing
linear_index = 10
print("\n For a matrix with dimensions (2, 3, 4), the linear index: ", linear_index, " is equal to \
matrix index: ", numpy.unravel_index(linear_index, z.shape))


[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]]
[[ 7 11]
 [19 23]]
[[ 8  9 10 11]
 [20 21 22 23]]

 For a matrix with dimensions (2, 3, 4), the linear index:  10  is equal to matrix index:  (0, 2, 2)


#### Exercise 2b.2

Create a $4\times3$ matrix of random numbers between $0$ and $1$. 
Find the row and column position of the minimum and the maximum value.

#### Exercise 2b.3 

Complete the following code to print years with the smallest number of hares, lynxes and carrots in the 
populations dataset.

In [None]:
for species in [....]:
    year = ...
    print("Least # of {} in year {}".format(species, year))

### Boolean indexing

A boolean index can be created directly, but most often it is built by specifying a certain condition.

The condition will return a True or False for every position in the array and when the condition is True then the corresponding element will be retrieved.

In [12]:
# Boolean indexing
x = numpy.arange(1, 6)
y = numpy.array([True, False, True, False, True ])
print("Only elements of x for which the value in y is True: ", x[y])

# boolean indexing by using a condition
print("Only elements of x for which the condition is True: ", x[x>3])

Only elements of x for which the value in y is True:  [1 3 5]
Only elements of x for which the condition is True:  [4 5]


#### Exercise 2b.4
Use the population data to

1. Select all the years in which there are more than 50000 lynxes;
2. Select all the years in which there are more lynxes than hares.

### Indexing with an array of indices

In this case you specify a separate array in which you store the indices as integers and you will return exactly the elements of the array with these indices.

One advantage of this is that you can explicitly specify the order in which you want to have the values and you can return multiple times the value at a certain position. 

In [13]:
x = numpy.arange(100, 111)
y = numpy.array([8, 3, 8, 4, 9, 3])
print("Array x: ", x)
print("Array with indices: ", y)
print("Indexing with an array of indices will give:", x[y])

Array x:  [100 101 102 103 104 105 106 107 108 109 110]
Array with indices:  [8 3 8 4 9 3]
Indexing with an array of indices will give: [108 103 108 104 109 103]


#### Exercise 2b.5

Indexing with an array is often useful when we want to randomize the order of items in some data. Complete the following code which creates a scrambled version of the population data

In [None]:
# Create an index for the rows of population (from 0 to population.shape[0])
index = ...
# Shuffle the index
numpy.random.shuffle(index)
# Create a scrambled version
population_rand = ...

## Vector stacking

Sometimes you want to combine two or more vectors to create an array. This is called vector stacking. Vector stacking can be done in two different ways horizontal and vertical. 
- horizontal stack: `numpy.hstack([x, y, z])`
- vertical stack: `numpy.vstack([x, y, z])`

In [14]:
x = numpy.arange(0,5)                     
y = numpy.arange(5, 10)   
z = numpy.arange(10, 15)
print("Horizontal stack: ",  numpy.hstack([x,y, z]) )
print("Vertical stack: ")
print( numpy.vstack([x,y, z]))

Horizontal stack:  [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
Vertical stack: 
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]


### Save data set to file

When you want to save an array from numpy as a separate file you always have to specify the filename and the array you want to save and you can use the following functions:
- `numpy.savetxt(filename, array)` : save an array to a text file. Some optional arguments are: delimiter=' ', newline = '\n', header = ' '. http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.savetxt.html#numpy.savetxt
- `numpy.save(filename, array)` : save an array to a binary file in numpy `.npy` format. http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.save.html#numpy.save


#### Exercise 2b.6 

Save the population data to a `.npy` file. Figure out how to load it back into a numpy array.

#### Exercise 2b.7
The files

- [irisa.txt](irisa.txt)
- [irisb.txt](irisb.txt)
- [irisc.txt](irisc.txt)

contain the data for the iris dataset. Each file has these columns:

- `SepalLength` 
- `SepalWidth`
- `PetalLength` 
- `PetalWidth` 
- `Species`

Load this data, and create a single array with all the species.