# Numpy Arrays

We'll import the `numpy` module using the short form `np` to save us typing. 

In [3]:
import numpy as np # imports a fast numerical programming library

![](numpy_arrays.slides.dir/2.png)

## Starting up with numpy arrays

Scientific Python code uses a fast array structure, called the numpy array. Those who have worked in Matlab will find this very natural.   For reference, the numpy documention can be found [here](https://docs.scipy.org/doc/numpy/reference/).  

Let's make a numpy array.

In [4]:
my_array = np.array([1, 2, 3, 4])
my_array

array([1, 2, 3, 4])

Numpy arrays are listy! Below we compute length, slice, and iterate. 

In [5]:
print(len(my_array))
print(my_array[2:4])
for ele in my_array:
    print(ele)

4
[3 4]
1
2
3
4


**However, in general you should manipulate numpy arrays by using numpy module functions** (`np.mean`, for example). This is for efficiency purposes, see the Vanderplas book. But briefly:

1. numpy arrays are typed, you keep ints together or floats together. They are not meant to combine objects of different types like python lists do.
2. numpy arrays are defined in the C language. You do not have to convert between python floats and C floats for example. Python types are different from the corresponding C-types because Python is garbage collected.

You can calculate the mean of the array elements either by calling the method `.mean` on a numpy array or by applying the function np.mean with the numpy array as an argument.

In [6]:
print(my_array.mean())
print(np.mean(my_array))

2.5
2.5


The way we constructed the numpy array above seems redundant..after all we already had a regular python list. Indeed, it is the other ways we have to construct numpy arrays that make them super useful. 

There are many such numpy array *constructors*. Here are some commonly used constructors. Look them up in the documentation.

In [7]:
np.ones(10) # generates 10 floating point ones

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

**Numpy gains a lot of its efficiency from being typed**. That is, all elements in the array have the same type, such as integer or floating point. The default type, as can be seen above, is a float of size appropriate for the machine (64 bit on a 64 bit machine).

In [8]:
np.dtype(float).itemsize # in bytes

8

In [9]:
np.ones(10, dtype='int') # generates 10 integer ones

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [10]:
np.ones(10).dtype

dtype('float64')

## Numpy supports vector operations

What does this mean? It means that instead of adding two arrays, element by element, you can just say: add the two arrays. Note that this behavior is very different from python lists.

![](numpy_arrays.slides.dir/3.png)

In [11]:
first = np.ones(5)
second = np.ones(5)
first + second

array([2., 2., 2., 2., 2.])

In [12]:
first_list = [1., 1., 1., 1., 1.]
second_list = [1., 1., 1., 1., 1.]
first_list + second_list #not what u want

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

On some computer chips this addition actually happens in parallel, so speedups can be high. But even on regular chips, the advantage of greater readability is important.

Numpy supports a concept known as *broadcasting*, which dictates how arrays of different sizes are combined together. There are too many rules to list here, but importantly, multiplying an array by a number multiplies each element by the number. Adding a number adds the number to each element.

In [13]:
first + 1

array([2., 2., 2., 2., 2.])

In [14]:
first*5

array([5., 5., 5., 5., 5.])

## 2D arrays

Similarly, we can create two-dimensional arrays.

![](numpy_arrays.slides.dir/6.png)

In [15]:
my_array2d = np.array([ [1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12] ])
my_array2d

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [16]:
# 3 x 4 array of ones
ones_2d = np.ones([3, 4])
ones_2d

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

Like lists, numpy arrays are 0-indexed.  Thus we can access the $n$th row and the $m$th column of a two-dimensional array with the indices $[n - 1, m - 1]$.

In [17]:
print(my_array2d)
my_array2d[2, 3]

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]


12

The 2D arrays are listy as well. They have set length (array dimensions), can be sliced, and can be iterated over with loop.  Below is a schematic illustrating slicing two-dimensional arrays.  

 <img src="images/2dindex_v2.png" alt="Drawing" style="width: 500px;"/>

In two dimensions, we need to provide the **shape** of the array, ie, the number of rows and columns of the array.

In [18]:
onesarray = np.ones([3,4])
onesarray

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [19]:
onesarray.shape

(3, 4)

Numpy functions will by default work on the entire array:

In [20]:
np.sum(onesarray)

12.0

The axis 0 is the one going downwards (the $y$-axis, so to speak), whereas axis 1 is the one going across (the $x$-axis). You will often use functions such as `mean`, `sum`, with an axis.

In [21]:
np.sum(onesarray, axis=0)

array([3., 3., 3., 3.])

In [22]:
np.sum(my_array2d, axis=0)

array([15, 18, 21, 24])

In [23]:
np.sum(onesarray, axis=1)

array([4., 4., 4.])

You should notice that access is row-by-row and one dimensional iteration gives a row. This is because `numpy` lays out memory row-wise.

![](images/2d-array-layout.png)

(from https://aaronbloomfield.github.io)

An often seen idiom allocates a two-dimensional array, and then fills in one-dimensional arrays from some function:

In [25]:
empty_array = np.empty((2,3))
empty_array

array([[0., 0., 0.],
       [0., 0., 0.]])

In [28]:
for i in range(empty_array.shape[0]):
    print(empty_array[i])

[0. 0. 0.]
[0. 0. 0.]


In [29]:
for i in range(empty_array.shape[0]):
    empty_array[i] = np.random.rand(3)
empty_array

array([[0.68625855, 0.09451053, 0.9559449 ],
       [0.03386545, 0.08089574, 0.84962253]])

## Pandas and numpy

You can very easily convert from Pandas to numpy. Just use the `.values` attribute on a pandas dataframe or series.

In [30]:
import pandas as pd
combined = pd.read_csv('data/combined_population_votes.csv')
combined.head()

Unnamed: 0,State,Population,Votes,popmills
0,Alaska,710000,3,0.71
1,Alabama,4780000,9,4.78
2,Arkansas,2916000,6,2.916
3,Arizona,6392000,11,6.392
4,California,37254000,55,37.254


For a Series Object:

In [31]:
combined.Votes.values

array([ 3,  9,  6, 11, 55,  9,  7,  3,  3, 29, 16,  4,  6,  4, 20, 11,  6,
        8,  8, 11, 10,  4, 16, 10, 10,  6,  3, 15,  3,  5,  4, 14,  5,  6,
       29, 18,  7,  7, 20,  4,  9,  3, 11, 38,  6, 13,  3, 12, 10,  5,  3])

For a Dataframe:

In [32]:
combined[['popmills', 'Votes']].values

array([[ 0.71 ,  3.   ],
       [ 4.78 ,  9.   ],
       [ 2.916,  6.   ],
       [ 6.392, 11.   ],
       [37.254, 55.   ],
       [ 5.029,  9.   ],
       [ 3.574,  7.   ],
       [ 0.602,  3.   ],
       [ 0.898,  3.   ],
       [18.801, 29.   ],
       [ 9.688, 16.   ],
       [ 1.36 ,  4.   ],
       [ 3.046,  6.   ],
       [ 1.568,  4.   ],
       [12.831, 20.   ],
       [ 6.484, 11.   ],
       [ 2.853,  6.   ],
       [ 4.339,  8.   ],
       [ 4.533,  8.   ],
       [ 6.548, 11.   ],
       [ 5.774, 10.   ],
       [ 1.328,  4.   ],
       [ 9.884, 16.   ],
       [ 5.304, 10.   ],
       [ 5.989, 10.   ],
       [ 2.967,  6.   ],
       [ 0.989,  3.   ],
       [ 9.535, 15.   ],
       [ 0.673,  3.   ],
       [ 1.826,  5.   ],
       [ 1.316,  4.   ],
       [ 8.792, 14.   ],
       [ 2.059,  5.   ],
       [ 2.701,  6.   ],
       [19.378, 29.   ],
       [11.537, 18.   ],
       [ 3.751,  7.   ],
       [ 3.831,  7.   ],
       [12.702, 20.   ],
       [ 1.053,  4.   ],


Using `values` then gives us a multi-dimensional numpy array (notice how the ints are cast to floats to keep the data types the same).