# Querying, Slicing, and Combining Arrays

One of the most important things that you will need to do as a data scientist is slice and combine data in different ways. In the next couple of weeks we will become acquainted with a very powerful tool to help us do that: pandas. However, in the meantime, let's see some more advanced ways to manipulate NumPy arrays.

In [1]:
import sys
print(sys.version)
import numpy as np
print(np.__version__)

3.5.0 (v3.5.0:374f501f4567, Sep 12 2015, 11:00:19) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
1.9.2


In [2]:
np.random.seed(10)

First we will generate some random arrays. Let's do one that is one dimensional and one that is two dimensional.

In [3]:
ar = np.arange(12)
ar

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [4]:
ar2 = np.random.random_integers(12, size=12).reshape((3,4))
ar2

array([[10,  5,  1,  2],
       [12, 10,  1,  2],
       [11,  9, 10,  1]])

## Querying
Just like Python lists, much of the time we will need to pull specific values out of an array. This is quite straightforward with a one-dimensional array.

In [5]:
ar[5]

5

In [6]:
ar[5:]

array([ 5,  6,  7,  8,  9, 10, 11])

In [7]:
ar[2:8:2]

array([2, 4, 6])

In [8]:
ar[:-1]

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [9]:
ar[-1:-8:-2]

array([11,  9,  7,  5])

Querying two dimensions gets a bit more interesting because clearly we have another dimension to think about. The way we will do this is the same way we did single array querying except we will specify which dimension we apply it to.

In [10]:
ar2

array([[10,  5,  1,  2],
       [12, 10,  1,  2],
       [11,  9, 10,  1]])

In [11]:
ar2[:,1:3]

array([[ 5,  1],
       [10,  1],
       [ 9, 10]])

With the above example, we can see that we have the first two columns. The first `:` specifies that we want all rows, while the second `:` specifies that we want columns 1 and 2.

Take a guess at how we might get the second two rows along with all columns.

In [12]:
ar2[1:3,:]

array([[12, 10,  1,  2],
       [11,  9, 10,  1]])

We basically applied the same command that we did in the first query, but we switched the positioning of the column and row queries. It's interesting that we can actually maintain steps increments like we have before. For example, we can get every second value from the array.

In [13]:
ar2[::2,::2]

array([[10,  1],
       [11, 10]])

Although we will not be delving too deeply into further dimensions, we can  apply the same methodology to higher dimensions. If we have a three-dimensional cube, we simply add another comma and write our filter.

## Combining Arrays

Now that we have covered the basics of querying arrays, let's see how to combine them. There are many ways to combine arrays. NumPy does it in a specific way, while other libraries have their own peculiarities. Let's first explore concatenating or stacking.

We will first proceed with a vertical stack.

In [14]:
ar

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [15]:
ar2.flatten()

array([10,  5,  1,  2, 12, 10,  1,  2, 11,  9, 10,  1])

In [16]:
np.vstack((ar,ar2.flatten()))

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11],
       [10,  5,  1,  2, 12, 10,  1,  2, 11,  9, 10,  1]])

You can see above that we have just concatenated them vertically, which is quite straightforward. Keep in mind that our array np2 is not actually modified; we simply returned a flattened version. If we print it out again, it will look exactly how we would expect.

In [17]:
ar2

array([[10,  5,  1,  2],
       [12, 10,  1,  2],
       [11,  9, 10,  1]])

We do not need to stack single-dimension arrays. We can do the same with multidimensional arrays too. Remember, basically anything that applies to a one- or two-dimensional array will apply to any number of dimensions (given that you are doing your queries correctly).

In [18]:
ar.reshape((3,4))

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [19]:
np.vstack((ar.reshape((3,4)),ar2))

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [10,  5,  1,  2],
       [12, 10,  1,  2],
       [11,  9, 10,  1]])

Now that we have stacked vertically, you might be able to guess what comes next: stacking horizontally. Again this applies to any number of dimensions and shapes that we might have.

In [20]:
np.hstack((ar, ar2.flatten()))

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 10,  5,  1,  2, 12,
       10,  1,  2, 11,  9, 10,  1])

In [21]:
np.hstack((ar.reshape((3,4)), ar2))

array([[ 0,  1,  2,  3, 10,  5,  1,  2],
       [ 4,  5,  6,  7, 12, 10,  1,  2],
       [ 8,  9, 10, 11, 11,  9, 10,  1]])

Thus far we have kept it quite simple; however, there is nothing stopping us from concatenating multiple arrays.

In [22]:
np.hstack((ar for x in range(10)))

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11,  0,  1,  2,  3,  4,
        5,  6,  7,  8,  9, 10, 11,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9,
       10, 11,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11,  0,  1,  2,
        3,  4,  5,  6,  7,  8,  9, 10, 11,  0,  1,  2,  3,  4,  5,  6,  7,
        8,  9, 10, 11,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11,  0,
        1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11,  0,  1,  2,  3,  4,  5,
        6,  7,  8,  9, 10, 11,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10,
       11])

While `vstack` and `hstack` are certainly straightforward, the more common terminology is to concatenate arrays. We do not have a horizontal or vertical concatenate keyword; instead we will have to start thinking in terms of axes. When we do not supply an axis (which will be numbered), NumPy defaults to 0, which is a horizontal stack.

In [23]:
np.concatenate((ar,ar2.flatten()))

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 10,  5,  1,  2, 12,
       10,  1,  2, 11,  9, 10,  1])

If we want to make it a vertical stack, we must change things by specifying the axis that it should join. Let's give that a try.

In [24]:
np.concatenate((ar,ar2.flatten()), axis=0)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 10,  5,  1,  2, 12,
       10,  1,  2, 11,  9, 10,  1])

In [25]:
np.concatenate((ar,ar2.flatten()), axis=1)

  if __name__ == '__main__':


array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 10,  5,  1,  2, 12,
       10,  1,  2, 11,  9, 10,  1])

This is where things get a bit more complicated. When `axis=0`, we should be getting back a horizontally stacked array. The second input gave us the expected value.

So what is happening here?

Basically NumPy (using concatenate) will not modify the number of dimensions of the original array using concatenate. If it starts out in one dimension, it will stay that way. It is required that all input arrays have the same dimensions. If we make these one-dimensional arrays into two dimensions, they will get stacked correctly.

In [26]:
np.concatenate(([ar],[ar2.flatten()]), axis=0)

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11],
       [10,  5,  1,  2, 12, 10,  1,  2, 11,  9, 10,  1]])

Thinking about the underlying dimensions is crucial to understanding behaviors like this.  It is also a good way to improve your grasp of NumPy.

Let's continue on with our multidimensional array.

In [27]:
np.concatenate((ar2,ar2))

array([[10,  5,  1,  2],
       [12, 10,  1,  2],
       [11,  9, 10,  1],
       [10,  5,  1,  2],
       [12, 10,  1,  2],
       [11,  9, 10,  1]])

In [28]:
np.concatenate((ar2,ar2), axis=1)

array([[10,  5,  1,  2, 10,  5,  1,  2],
       [12, 10,  1,  2, 12, 10,  1,  2],
       [11,  9, 10,  1, 11,  9, 10,  1]])

We should touch on one additional stacking technique: depth stacking. This does not come up too much, so we will be brief, but you should be aware of it. When we want to add a `depth` dimension to the data, we do the `dstack`. For example, when using two two-dimensional arrays and a `dstack`, it will be like putting one behind the other along the vertical axis. The same can be completed with the `concatenate` command along the third axis (but like above we would have to make sure we already had the third dimension).

In [29]:
np.dstack((ar2,ar2))

array([[[10, 10],
        [ 5,  5],
        [ 1,  1],
        [ 2,  2]],

       [[12, 12],
        [10, 10],
        [ 1,  1],
        [ 2,  2]],

       [[11, 11],
        [ 9,  9],
        [10, 10],
        [ 1,  1]]])

## Splitting Arrays

So far we have been creators. We have joined plenty of different arrays together in various ways. Now let's explore how we break off chunks of those arrays. We will start with the `dsplit`, which simply reverses what we do in the `dstack` command.

In [30]:
ar3 = np.dstack((ar2,ar2))
np.dsplit(ar3,2)

[array([[[10],
         [ 5],
         [ 1],
         [ 2]],
 
        [[12],
         [10],
         [ 1],
         [ 2]],
 
        [[11],
         [ 9],
         [10],
         [ 1]]]), array([[[10],
         [ 5],
         [ 1],
         [ 2]],
 
        [[12],
         [10],
         [ 1],
         [ 2]],
 
        [[11],
         [ 9],
         [10],
         [ 1]]])]

However, as we can see, it does not quite give us exactly what we put in because it is now a list of two three-dimensional objects. To get it back into the correct form we will need to use the `squeeze` function, which removes a single dimension from the n-dimensional array.

In [31]:
np.dsplit(ar3,2)[0] # equivalent to ar2 in terms of values

array([[[10],
        [ 5],
        [ 1],
        [ 2]],

       [[12],
        [10],
        [ 1],
        [ 2]],

       [[11],
        [ 9],
        [10],
        [ 1]]])

In [32]:
np.squeeze(np.dsplit(ar3,2)[0])

array([[10,  5,  1,  2],
       [12, 10,  1,  2],
       [11,  9, 10,  1]])

Don't be too concerned if you do not completely understand the `dsplit` as it does not come up too much. If you are confused when doing these kinds of problems, try to sketch on a piece of paper. Think about the transformations that you want to perform and the end result or shape that you want.

Let's finish up with `hsplit` and `vsplit`. These, as you might have guessed, will split horizontally and vertically.

In [33]:
ar2

array([[10,  5,  1,  2],
       [12, 10,  1,  2],
       [11,  9, 10,  1]])

In [34]:
np.hsplit(ar2,2)

[array([[10,  5],
        [12, 10],
        [11,  9]]), array([[ 1,  2],
        [ 1,  2],
        [10,  1]])]

This has split into two different parts with a vertical line right down the middle, which we can see if we select the first item in the list of arrays that it returns.

In [35]:
np.hsplit(ar2,2)[0]

array([[10,  5],
       [12, 10],
       [11,  9]])

The `vsplit` will do the same. However, just like `hsplit`, it will require an even number of "chunks" to split into.

In [36]:
ar2.reshape((4,3))

array([[10,  5,  1],
       [ 2, 12, 10],
       [ 1,  2, 11],
       [ 9, 10,  1]])

In [37]:
np.vsplit(ar2.reshape((4,3)),2)

[array([[10,  5,  1],
        [ 2, 12, 10]]), array([[ 1,  2, 11],
        [ 9, 10,  1]])]

In [38]:
ar4 = ar2.reshape((6,2))
ar4

array([[10,  5],
       [ 1,  2],
       [12, 10],
       [ 1,  2],
       [11,  9],
       [10,  1]])

Again, we do not need to stop at two chunks; we can split into three if the number of rows or columns will let us.

In [39]:
np.vsplit(ar4,3)

[array([[10,  5],
        [ 1,  2]]), array([[12, 10],
        [ 1,  2]]), array([[11,  9],
        [10,  1]])]

In [40]:
np.vsplit(ar4,3)[0]

array([[10,  5],
       [ 1,  2]])

The only requirement is that it can easily split into multiple chunks.

In addition to multidimensional arrays, it is worth mentioning that there is a dedicated matrix type in NumPy. You can certainly look into that, but getting that in-depth is outside the scope of this class. You should at least be aware that you might see that type sometime in your career.

We have covered a lot of NumPy, and although this has been very unapplied, a lot of these concepts will come up in the coming weeks and throughout your time at Berkeley. It is important that you have an understanding of these concepts going forward and are comfortable thinking in these ways.