# Querying, Slicing and Combining Arrays

One of the most important things that you're going to need to do as a data scientist is slice and combine data in different ways. Now in the next couple of weeks we're going to get introduced to a very powerful tool to help us do that. However, in the mean time we're going to stick to the fundamentals with NumPy.

We've covered boolean selection as a way to filter data. However that are a lot of other manipulations that can be performed as well.

In [1]:
import sys
print(sys.version)
import numpy as np
print(np.__version__)

3.3.5 |Anaconda 2.2.0 (64-bit)| (default, Sep  2 2014, 13:55:40) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
1.9.2


In [2]:
np.random.seed(10)

First we're going to generate some random arrays. Let's do one that's one dimension and one that's two dimensional.

In [3]:
ar = np.arange(12)
ar

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [5]:
ar2 = np.random.random_integers(12, size=12).reshape((3,4))
ar2

array([[11,  9,  7,  5],
       [ 4,  1,  5, 12],
       [ 7,  9, 12, 11]])

## Querying
Just like python lists, a lot of the time we're going to need to query some data. That's pretty straight forward with a one dimensional array.

In [6]:
ar[5]

5

In [7]:
ar[5:]

array([ 5,  6,  7,  8,  9, 10, 11])

In [8]:
ar[2:8:2]

array([2, 4, 6])

In [9]:
ar[:-1]

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [10]:
ar[-1:-8:-2]

array([11,  9,  7,  5])

Now querying two dimensions gets a bit more interesting because obviously we've got another dimension to think about. The way we're going to do this is just like how we did single array querying except we're going to specify which dimension we apply it to.

In [11]:
ar2

array([[11,  9,  7,  5],
       [ 4,  1,  5, 12],
       [ 7,  9, 12, 11]])

In [12]:
ar2[:,1:3]

array([[ 9,  7],
       [ 1,  5],
       [ 9, 12]])

With the above example, we can see that we got the first two columns. The first `:` specifies that we want all rows while the second `:` specifies that we want columns 1 and 2.

Take a guess at how we might get the second two rows along with all columns.

In [13]:
ar2[1:3,:]

array([[ 4,  1,  5, 12],
       [ 7,  9, 12, 11]])

So we pretty much applied the same command that we did in the first query, but we switched the positioning of the column and row queries. What's cool is that we can actually maintain steps increments like we have before. For example we can get every second value from the array.

In [15]:
ar2[::2,::2]

array([[11,  7],
       [ 7, 12]])

While we won't be diving too deep into further and further dimensions, we can actually apply the same methodology to higher and higher dimensions. If we've got a three-dimensional cube, we just add another comma and write our filter.

## Combining Arrays

Now that we've covered the basics of querying arrays. Let's go over combining them. Now there are lots and lots of ways to combine arrays. NumPy does it a specific way while other libraries each have their own peculiarities. Let's first explore concatenating or stacking.

We'll first go with a vertical stack.

In [16]:
ar

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [22]:
ar2.flatten()

array([11,  9,  7,  5,  4,  1,  5, 12,  7,  9, 12, 11])

In [21]:
np.vstack((ar,ar2.flatten()))

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11],
       [11,  9,  7,  5,  4,  1,  5, 12,  7,  9, 12, 11]])

We can see above that we just concatenated them vertically - pretty straightforward. One thing to keep in mind as well, remember that our array np2 isn't actually modified. We just returned a flattened version. If we print it out again, it'll look like exactly what we expect it to.

In [24]:
ar2

array([[11,  9,  7,  5],
       [ 4,  1,  5, 12],
       [ 7,  9, 12, 11]])

Now we don't just have to stack single dimension arrays. We can do the same with multi-dimensional ones too. Remember, basically anything that applies to a one or two dimensional array will apply to any number of dimensions (given that you're doing your queries correctly).

In [25]:
ar.reshape((3,4))

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [26]:
np.vstack((ar.reshape((3,4)),ar2))

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [11,  9,  7,  5],
       [ 4,  1,  5, 12],
       [ 7,  9, 12, 11]])

Now we've stacked vertically, you might be able to guess what is next. Stacking horizontally. Again this applies to any number of dimensions and shapes that we might have.

In [28]:
np.hstack((ar, ar2.flatten()))

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 11,  9,  7,  5,  4,
        1,  5, 12,  7,  9, 12, 11])

In [29]:
np.hstack((ar.reshape((3,4)), ar2))

array([[ 0,  1,  2,  3, 11,  9,  7,  5],
       [ 4,  5,  6,  7,  4,  1,  5, 12],
       [ 8,  9, 10, 11,  7,  9, 12, 11]])

Thus far we've been keeping it pretty simple, however there's nothing stopping us from concatenating multiple arrays.

In [33]:
np.hstack((ar for x in range(10)))

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11,  0,  1,  2,  3,  4,
        5,  6,  7,  8,  9, 10, 11,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9,
       10, 11,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11,  0,  1,  2,
        3,  4,  5,  6,  7,  8,  9, 10, 11,  0,  1,  2,  3,  4,  5,  6,  7,
        8,  9, 10, 11,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11,  0,
        1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11,  0,  1,  2,  3,  4,  5,
        6,  7,  8,  9, 10, 11,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10,
       11])

Now while `vstack` and `hstack` are certainly straightforward, the more common terminology is to concatenate arrays. Now we don't have a nice horizontal or vertical concatenate keyword, instead we're going to have to start thinking in terms of axes. When we don't supply an axis (which will be numbered) we can see that NumPy defaults to 0 which is a horizontal stack.

In [34]:
np.concatenate((ar,ar2.flatten()))

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 11,  9,  7,  5,  4,
        1,  5, 12,  7,  9, 12, 11])

If we want to make that a vertical stack we've got to change things up a bit by specificying the axis on which it should join. So let's go ahead and give it a try.

In [38]:
np.concatenate((ar,ar2.flatten()), axis=0)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 11,  9,  7,  5,  4,
        1,  5, 12,  7,  9, 12, 11])

In [43]:
np.concatenate((ar,ar2.flatten()), axis=1)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 11,  9,  7,  5,  4,
        1,  5, 12,  7,  9, 12, 11])

This is where things get a bit more complicated. When `axis=0`, we should be getting back an `hstacked` array. The second input gave us the expected value.

So what's happening here?

Basically NumPy (using concatenate) won't modify the number of dimensions of the original array using concatenate. If it starts out in 1 dimension, it'll stay that way. It is required that all input arrays have the same dimensions. If we make these one dimensional arrays into two dimensions, they'll get stacked correctly.

In [49]:
np.concatenate(([ar],[ar2.flatten()]), axis=0)

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11],
       [11,  9,  7,  5,  4,  1,  5, 12,  7,  9, 12, 11]])

Thinking about the underlying type(s) is a great way to 'level up' in programming. It's really the key to understanding what is going on when and when. 

Let's continue on with our multidimensional array.

In [50]:
np.concatenate((ar2,ar2))

array([[11,  9,  7,  5],
       [ 4,  1,  5, 12],
       [ 7,  9, 12, 11],
       [11,  9,  7,  5],
       [ 4,  1,  5, 12],
       [ 7,  9, 12, 11]])

In [51]:
np.concatenate((ar2,ar2), axis=1)

array([[11,  9,  7,  5, 11,  9,  7,  5],
       [ 4,  1,  5, 12,  4,  1,  5, 12],
       [ 7,  9, 12, 11,  7,  9, 12, 11]])

There's one last stacking technique that we should touch on called depth stacking. Now this doesn't come up to much, so we'll be brief, but you should be aware of it. Basically when we want to add a `depth` dimension to your data, we do the `dstack`. For example, when using 2 two dimensional arrays and a `dstack` it will be like putting one behind the other along the vertical axis. The same can be completed with the `concatenate` command along the third axis (but like above we'd have to make sure we already had that third dimension).

In [56]:
np.dstack((ar2,ar2))

array([[[11, 11],
        [ 9,  9],
        [ 7,  7],
        [ 5,  5]],

       [[ 4,  4],
        [ 1,  1],
        [ 5,  5],
        [12, 12]],

       [[ 7,  7],
        [ 9,  9],
        [12, 12],
        [11, 11]]])

## Splitting Arrays

So far we've been creators. We've joined plenty of different arrays together in various ways. Now let's explore how we break off chunks of those arrays. We'll start with the `dsplit` which just reverses what we do in the `dstack` command.

In [66]:
ar3 = np.dstack((ar2,ar2))
np.dsplit(ar3,2)

[array([[[11],
         [ 9],
         [ 7],
         [ 5]],
 
        [[ 4],
         [ 1],
         [ 5],
         [12]],
 
        [[ 7],
         [ 9],
         [12],
         [11]]]), array([[[11],
         [ 9],
         [ 7],
         [ 5]],
 
        [[ 4],
         [ 1],
         [ 5],
         [12]],
 
        [[ 7],
         [ 9],
         [12],
         [11]]])]

However as we can see, it doesn't quite give us exactly what we put in. That's because it's now a list of two 3 dimensional objects. To get it back into the correct form we're going to have to use the `squeeze` function which removes a single dimension from the ndimensional array.

In [69]:
np.dsplit(ar3,2)[0] # equivalent to ar2 in terms of values

array([[[11],
        [ 9],
        [ 7],
        [ 5]],

       [[ 4],
        [ 1],
        [ 5],
        [12]],

       [[ 7],
        [ 9],
        [12],
        [11]]])

In [70]:
np.squeeze(np.dsplit(ar3,2)[0])

array([[11,  9,  7,  5],
       [ 4,  1,  5, 12],
       [ 7,  9, 12, 11]])

Now if you don't completely understand the `dsplit` that's alright, to be honest it doesn't come up too much. If you're confused when doing these kinds of problems one of the thing that can really help is to break out a piece of paper and do some sketching. Think about the transformations that you want to perform, what is the end result or shape that you want?

Moving right along let's finish up with `hsplit` and `vsplit`. These as you might have guessed will split horizontally and vertically.

In [77]:
ar2

array([[11,  9,  7,  5],
       [ 4,  1,  5, 12],
       [ 7,  9, 12, 11]])

In [76]:
np.hsplit(ar2,2)

[array([[11,  9],
        [ 4,  1],
        [ 7,  9]]), array([[ 7,  5],
        [ 5, 12],
        [12, 11]])]

What this has done is split with a vertical line right now the middle, we've split into two different parts. Which we can see if we use select the first item in the list of arrays that it returns.

In [80]:
np.hsplit(ar2,2)[0]

array([[11,  9],
       [ 4,  1],
       [ 7,  9]])

The `vsplit` will do the same. However just like hsplit, it will require an even number of "chunks" to split into.

In [81]:
ar2.reshape((4,3))

array([[11,  9,  7],
       [ 5,  4,  1],
       [ 5, 12,  7],
       [ 9, 12, 11]])

In [82]:
np.vsplit(ar2.reshape((4,3)),2)

[array([[11,  9,  7],
        [ 5,  4,  1]]), array([[ 5, 12,  7],
        [ 9, 12, 11]])]

In [84]:
ar4 = ar2.reshape((6,2))
ar4

array([[11,  9],
       [ 7,  5],
       [ 4,  1],
       [ 5, 12],
       [ 7,  9],
       [12, 11]])

Again, we don't need to stop at 2 chunks, we can split into 3 if the number of rows or columns will let us.

In [85]:
np.vsplit(ar4,3)

[array([[11,  9],
        [ 7,  5]]), array([[ 4,  1],
        [ 5, 12]]), array([[ 7,  9],
        [12, 11]])]

In [86]:
np.vsplit(ar4,3)[0]

array([[11,  9],
       [ 7,  5]])

The only requirement is that it can easily split into multiple chunks.

Now that's all that I wanted to cover. I will mention that there is a dedicated matrix type in numpy. You can certainly look into that if you like but diving that deep is outside the scope of this class. Just be sure to be aware that you might see that type sometime in your career.

Now we've covered a ton of NumPy and we're sure that you're thinking that this is been very unapplied, however a lot of these concepts will come up in the coming weeks and throughout your time at Berkeley. It's really important that you have an understanding of these concepts going forward and are comfortable thinking in these ways.