# Introduction to NumPy

NumPy is the fundamental package for scientific computing with Python. It contains among other things:
- a powerful N-dimensional array object
- sophisticated (broadcasting) functions
- tools for integrating C/C++ and Fortran code
- useful linear algebra, Fourier transform, and random number capabilities

NumPy is the basis for the scientific computing and data science stacks in Python, including SciPy and Pandas.

![NumPy and SciPy](numpy-and-scipy.png)

Website: [http://numpy.org](http://numpy.org/)
Github repository: [https://github.com/numpy/numpy](https://github.com/numpy/numpy)

NumPy 1.0 was released in October 2006. The latest release, 1.12.1, was released March 18, 2017.


## Why not just use regular Python?

After all, Python can do math with integers and floating point numbers, has a built in math and array packages, and fast and flexible container types:

- Lists: cheap insert and append methods, can hold any type of data
- Dictionaries: Mapping type with very fast lookup operations

However, NumPy has an *extremely* efficient n-dimensional array implementation, and it has parallelized versions of the main Python math operations which can operate very quickly and efficiently on those arrays. This makes it very well suited to the kinds of vector and matrix operations that the SciPy stack builds on.

Most of the time you won't be using NumPy directly but it is still good to understand how its core pieces work.

## The basic type: `ndarray`

The n-dimensional array (`ndarray`) is the fundamental data structure in NumPy. It is a table of elements, all of the same type, which can be indexed (that is, accessed) by a tuple of integers.


In [2]:
import numpy as np

# Each element must be of the same type.
intarray = np.array([1, 2, 3, 4])
intarray.dtype

dtype('int32')

In [3]:
floatarray = np.array([1.0, 2.0, 3.0, 4.0])
floatarray.dtype

dtype('float64')

In [4]:
# Numpy will try to coerce elements to the same datatype:
mixedarray = np.array([1, 2.0, 3, np.pi])
mixedarray.dtype

dtype('float64')

In [5]:
# If you force it to coerce too much you end up with "object" as a dtype, which is not useful. Stick to numbers.
dont_do_this = np.array([1, False, "a string", object()])
dont_do_this.dtype

dtype('O')

In [6]:
# You can specify the dtype when creating the ndarray, otherwise it will be inferred, defaulting to float64
floats = np.array([0, 1, 2], dtype="float64")
floats.dtype

dtype('float64')

## Arrays have shapes and sizes

The "nd" in `ndarray` means arrays can have any number of *dimensions*. An array has the following properties:
- `ndim`: the number of axes. Also known as the *rank* of the array.
- `shape`: the dimensions of the array. A tuple of integers indicating the size of the array in each dimension.
- `size`: the number of elements in the array.


In [7]:
def describe_array(name, a):
    print(name)
    print(a)
    print("Array ndim: ", a.ndim)
    print("Array shape: ", a.shape)
    print("Array size: ", a.size)
    print()


describe_array("1 row, 3 columns", np.array([1, 2, 3]))
describe_array("2 rows, 3 columns", np.array([[1, 2, 3], [4, 5, 6]]))
describe_array("3 rows, 2 columns", np.array([[1, 2], [3, 4], [5, 6]]))
describe_array("2 faces, 3 rows, 1 column", np.array([[[1], [2], [3]], [[4], [5], [6]]]))

1 row, 3 columns
[1 2 3]
Array ndim:  1
Array shape:  (3,)
Array size:  3

2 rows, 3 columns
[[1 2 3]
 [4 5 6]]
Array ndim:  2
Array shape:  (2, 3)
Array size:  6

3 rows, 2 columns
[[1 2]
 [3 4]
 [5 6]]
Array ndim:  2
Array shape:  (3, 2)
Array size:  6

2 faces, 3 rows, 1 column
[[[1]
  [2]
  [3]]

 [[4]
  [5]
  [6]]]
Array ndim:  3
Array shape:  (2, 3, 1)
Array size:  6



## Creating Arrays

There are several convenience functions for creating arrays.

- `np.arange`: Create an array from a range of numbers
- `np.zeros`: An array filled with zeroes
- `np.ones`: An array filled with ones
- `np.eye`: An identity matrix
- `np.random.*`: Arrays filled with random values from various distributions


In [8]:
# np.arange([start,], stop[, step])
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [9]:
np.arange(100, 90, -1)

array([100,  99,  98,  97,  96,  95,  94,  93,  92,  91])

In [10]:
# np.zeros(shape) where `shape` is a tuple.
# e.g. 1 row, 3 columns
np.zeros((1, 3))

array([[ 0.,  0.,  0.]])

In [11]:
# or 2 faces, 5 rows, 2 columns
np.zeros((2, 5, 2))

array([[[ 0.,  0.],
        [ 0.,  0.],
        [ 0.,  0.],
        [ 0.,  0.],
        [ 0.,  0.]],

       [[ 0.,  0.],
        [ 0.,  0.],
        [ 0.,  0.],
        [ 0.,  0.],
        [ 0.,  0.]]])

In [12]:
# np.eye(rows, columns=None, k=0)
np.eye(4)

array([[ 1.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  1.]])

In [13]:
# np.random.randint(low, high, shape)
np.random.randint(0, 10, (3, 3))

array([[6, 2, 5],
       [5, 8, 5],
       [9, 2, 7]])

In [14]:
# np.random.normal(mean, stdev, shape)
np.random.normal(0, 1, (4, 4))

array([[ 0.18704042,  0.06793928, -0.74054112, -2.04129821],
       [-2.26942806, -0.99240057,  0.40324904,  0.29846174],
       [-2.90820995, -0.93991624,  0.81164193,  1.00287833],
       [-0.89818173, -0.28382557, -1.34148789,  1.37263137]])

## Simple Arithmetic Operations on Arrays

Buit-in python arithmetic operations work element-wise on arrays, like you would expect.

NumPy also includes its own versions of many math operations as what it calls `ufunc`s, or universal functions. These offer high-speed, parallelized versions of basic operations.


In [15]:
a = np.arange(0, 10)
print(a)

[0 1 2 3 4 5 6 7 8 9]


In [16]:
a + 10

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

In [44]:
a * 2

array([ 4, 10, 14,  4,  2, 16, 14, 14,  6,  6, 12, 16, 10,  8,  2, 16, 18,
        8,  4,  8])

### List vs. Array Performance

It's easy to prove that for numeric operations, NumPy arrays vastly outperform lists.

In [59]:
# Let's create a python list and an array, each with 10 million integers.
big_list = list(range(10000000))
big_array = np.arange(10000000)

In [60]:
# See how long it takes to get a new list containing the square of each element from big_list
%timeit big_list_squared = [i * i for i in big_list]

1 loop, best of 3: 1.46 s per loop


In [65]:
# See how long it takes to get a new array containing the square of each element from big_array
%timeit big_array_squared = big_array * big_array

10 loops, best of 3: 38 ms per loop


In [67]:
import math
%timeit big_list_sines = [math.sin(i) for i in big_list]

1 loop, best of 3: 3.37 s per loop


In [76]:
# numpy ufuncs provide vectorized operations that go beyond simple arithmetic.
# np.sin() calculates the sine of each element in the passed-in array.
%timeit big_array_sines = np.sin(big_array)

10 loops, best of 3: 206 ms per loop


In [74]:
# And not only is the list slower, it also uses a more memory. Below is its size in mbytes.
import sys
sys.getsizeof(big_list) / 1024 / 1024

85.83079528808594

In [75]:
# vs. the size of the array in mbytes
sys.getsizeof(big_array) / 1024 / 1024

38.147064208984375

In [77]:
# Note there is a special function for the dot product, do not use the built-in '*' operator
a1 = np.arange(0, 9).reshape(3,3)
a2 = np.arange(10, 19).reshape(3,3)
print("a1:", repr(a1))
print()
print("a2:", repr(a2))


a1: array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

a2: array([[10, 11, 12],
       [13, 14, 15],
       [16, 17, 18]])


In [78]:
# the * operator performs element-wise multiplication
a1 * a2

array([[  0,  11,  24],
       [ 39,  56,  75],
       [ 96, 119, 144]])

In [21]:
# to get the dot product of a matrix and a vector:
a1.dot(a2[0])

array([ 35, 134, 233])

In [22]:
# or the dot product of two matrices:
a1.dot(a2)

array([[ 45,  48,  51],
       [162, 174, 186],
       [279, 300, 321]])

## Matrices

NumPy actually does provide a special `ndarray` subclass called `matrix`. It is essentially a two-dimensional array with a few extra utility functions and custom operators.

Notably, applying the built-in `*` operator to matrix objects results in the dot product.

In [80]:
m1 = np.matrix(a1)
m2 = np.matrix(a2)

In [81]:
m1 * m2

matrix([[ 45,  48,  51],
        [162, 174, 186],
        [279, 300, 321]])

## Accessing Array Elements and Slicing

Arrays can be sliced like so:
    
    array[lower:upper:step]

The `upper` bound is never included in the result.

In [23]:
a = np.array([10, 11, 12, 13, 14])
a[1:3]

array([11, 12])

In [24]:
# slicing a multidimensional array:
a = np.arange(9).reshape(3, 3)
a

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [25]:
# to get a scalar element value, specify its location in all dimensions (here, just row and column):
a[1, 1]

4

In [26]:
# retrieve an entire row by index:
a[0]

array([0, 1, 2])

In [27]:
# retrieve an entire column by index (note the slice syntax to get all rows,
# then the index to specify which column to extract from each row)
a[:, 0]

array([0, 3, 6])

In [28]:
# to select a sub-matrix, use slice syntax to specify row and column locations.
# here, we get the bottom-left corner of the matrix:
# (all rows starting with the second row, and all columns but the last one)
a[1:, :2]

array([[3, 4],
       [6, 7]])

#### Array slices are "views", not copies

That is, when you assign an array slice to a variable, and then change the data in the original array, the data in the variable (that is, the slice) changes as well.

In [29]:
# Get a slice of `a`, in this case the entire first row.
a = np.arange(10)
b = a[:]
b

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [30]:
# Now let's change the last element from 9 to 90
a[-1] = 90

In [31]:
# and we see that the value changes in the view as well.
b

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8, 90])

In [32]:
# to get a copy of an array, use the `copy` method:
a = np.arange(10)
b = a.copy()
b

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [33]:
# because we used `copy` to create array `b`, changing a value in `a` will not affect `b`:
a[-1] = 90
b

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

#### "Fancy Indexing"

You can use boolean comparisons to select just the elements you want from an array.

In [34]:
# boolean comparisons to arrays produce a "mask" array: an array of the same shape filled with boolean values
# indicating whether the item at that position in the original array meets the boolean condition.
a = np.random.randint(0, 10, (20))
print(a)
print(a > 5)

[2 5 7 2 1 8 7 7 3 3 6 8 5 4 1 8 9 4 2 4]
[False False  True False False  True  True  True False False  True  True
 False False False  True  True False False False]


In [35]:
# We can now use that "mask" array to select elements from an array with the same shape.
# for example, below will give us a new array with only the elements in `a` greater than 5.
# NB: in this case we get a copy of the original data, not a view like we would if we had sliced the array.
a[a > 5]

array([7, 8, 7, 7, 6, 8, 8, 9])

In [43]:
# It's possible to chain multiple boolean tests together with the & and | operators
# (although the conditions must be in parentheses)
a[(a <= 2) | (a >= 8)]

array([2, 2, 1, 8, 8, 1, 8, 9, 2])

## Transposition

`ndarray` objects provide easy access to their transposed axes. That is, we can turn rows into columns and vice-versa.

In [90]:
a = np.arange(18).reshape((2,3,3))
a

array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]]])

In [96]:
# the `T` property of an ndarray provides easy access to the transpositions of all axes:
a.T

array([[[ 0,  9],
        [ 3, 12],
        [ 6, 15]],

       [[ 1, 10],
        [ 4, 13],
        [ 7, 16]],

       [[ 2, 11],
        [ 5, 14],
        [ 8, 17]]])

In [97]:
# the `transpose` method lets us transpose axes in a particular order:
a.transpose((0, 2, 1))

array([[[ 0,  3,  6],
        [ 1,  4,  7],
        [ 2,  5,  8]],

       [[ 9, 12, 15],
        [10, 13, 16],
        [11, 14, 17]]])

### Vector Stacking and Combining.

Combining arrays can be done multiple ways.

"Stacking" means keeping the original arrays intact as new rows or columns on a new dimension.

"Concatenating" means combining into a new array with the same dimensionality as the original arrays.

In [107]:
a1 = np.arange(3)
a2 = np.arange(3, 6)

In [108]:
# stack "column-wise". This can also be done via the deprecated `np.hstack` function.
np.stack([a1, a2])

array([[0, 1, 2],
       [3, 4, 5]])

In [109]:
# stack "row-wise", as the `np.vstack` function.
np.stack([a1, a2], axis=1)

array([[0, 3],
       [1, 4],
       [2, 5]])

In [112]:
np.concatenate([a1, a2])

array([0, 1, 2, 3, 4, 5])



### Loading (and saving) Data

There are several built-in functions in NumPy to load data.

- `fromfile`: fast and efficient loading of binary data (or simple text files)
- `ndarray.tofile`: save the contents of an array to a data file (NOT PORTABLE!)
- `save`: save an array in a portable format
- `load`: load data from the portable `save` format
- `loadtxt`: more flexible way of loading data from a text file
- `savetxt`: save data to a text file, with custom formatting
- `genfromtxt`: very flexible data loading, can handle missing/empty data


In [37]:
with open('data.csv') as f:
    for row in f:
        print(row)

Month,Delta,Abs(Delta)

1,70.77,70.77

2,78.64,78.64

3,71.78,71.78

4,69.57,69.57

5,61.48,61.48

6,22.23,23.3

7,-40.74,40.74

8,-89.19,89.19

9,-106.9,106.9

10,-95.84,95.84

11,-52.83,52.83

12,16.65,22.26



In [38]:
data = np.loadtxt("data.csv", skiprows=1, delimiter=",")

In [39]:
data

array([[   1.  ,   70.77,   70.77],
       [   2.  ,   78.64,   78.64],
       [   3.  ,   71.78,   71.78],
       [   4.  ,   69.57,   69.57],
       [   5.  ,   61.48,   61.48],
       [   6.  ,   22.23,   23.3 ],
       [   7.  ,  -40.74,   40.74],
       [   8.  ,  -89.19,   89.19],
       [   9.  , -106.9 ,  106.9 ],
       [  10.  ,  -95.84,   95.84],
       [  11.  ,  -52.83,   52.83],
       [  12.  ,   16.65,   22.26]])

In [40]:
data.dtype

dtype('float64')

In [41]:
# By default, genfromtxt gives the same (two-dimensional) output.
np.genfromtxt("data.csv", skip_header=1, delimiter=",")

array([[   1.  ,   70.77,   70.77],
       [   2.  ,   78.64,   78.64],
       [   3.  ,   71.78,   71.78],
       [   4.  ,   69.57,   69.57],
       [   5.  ,   61.48,   61.48],
       [   6.  ,   22.23,   23.3 ],
       [   7.  ,  -40.74,   40.74],
       [   8.  ,  -89.19,   89.19],
       [   9.  , -106.9 ,  106.9 ],
       [  10.  ,  -95.84,   95.84],
       [  11.  ,  -52.83,   52.83],
       [  12.  ,   16.65,   22.26]])

In [42]:
# But if we specify `dtype=None` as an argument, it returns a one-dimensional array of "structured" arrays,
# where the data type for each element is determined automatically.
arr = np.genfromtxt("data.csv", dtype=None, skip_header=1, delimiter=",")
arr

array([( 1,   70.77,   70.77), ( 2,   78.64,   78.64),
       ( 3,   71.78,   71.78), ( 4,   69.57,   69.57),
       ( 5,   61.48,   61.48), ( 6,   22.23,   23.3 ),
       ( 7,  -40.74,   40.74), ( 8,  -89.19,   89.19),
       ( 9, -106.9 ,  106.9 ), (10,  -95.84,   95.84),
       (11,  -52.83,   52.83), (12,   16.65,   22.26)], 
      dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8')])