# NumPy Basics: Arrays and Vectorized Computation

> NumPy, short for Numerical Python, is one of the most important foundational packages for numerical computing in Python. Many computational packages providing scientific functionality use NumPy's array objects as one of the standard interfaces lingua franca for data exchange. Numpy provides:



* ndarray, an efficient multidimensional array providing fast array-oriented arithmetic operations and flexible broadcasting capabilities.

* Mathematical functions for fast operations on entire arrays of data without having to write loops.

* Tools for reading/writing array data to disk and working with memory-mapped files.

* Linear algebra, random number generation, and Fourier transform capabilities.

* A C API for connecting NumPy with libraries written in C, C++, or FORTRAN

## What will be covered?

For most data analysis applications, the main areas of functionality I’ll focus on are:

 * Fast array-based operations for data munging and cleaning, subsetting and filtering, transformation, and any other kinds of computations

 * Common array algorithms like sorting, unique, and set operations

 * Efficient descriptive statistics and aggregating/summarizing data

 * Data alignment and relational data manipulations for merging and joining together heterogeneous datasets

 * Expressing conditional logic as array expressions instead of loops with if-elif-else branches

 * Group-wise data manipulations (aggregation, transformation, function application)



> While NumPy provides a computational foundation for general numerical data processing, many readers will want to use pandas as the basis for most kinds of statistics or analytics, especially on tabular data. pandas also provides some more domain-specific functionality like time series manipulation, which is not present in NumPy.

### Why Numpy so Important?



One of the reasons NumPy is so important for numerical computations in Python is because it is designed for efficiency on large arrays of data. There are a number of reasons for this:

* NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. NumPy's library of algorithms written in the C language can operate on this memory without any type checking or other overhead. NumPy arrays also use much less memory than built-in Python sequences.

* NumPy operations perform complex computations on entire arrays without the need for Python for loops, which can be slow for large sequences. NumPy is faster than regular Python code because its C-based algorithms avoid overhead present with regular interpreted Python code.


## Lets compare performance of NumPy and Python.

In [None]:
import numpy as np
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
np.set_printoptions(precision=4, suppress=True)

consider a NumPy array of one million integers, and the equivalent Python list:

In [None]:
list(range(5))

In [None]:
import numpy as np
my_arr = np.arange(1000000) # numpy array
my_list = list(range(1000000)) # python list

Now let's multiply each sequence by 2:

In [None]:
%time for _ in range(10): my_arr2 = my_arr * 2 #In Linux or Unix: %timeit
%time for _ in range(10): my_list2 = [x * 2 for x in my_list]

> NumPy-based algorithms are generally 10 to 100 times faster (or more) than their pure Python counterparts and use significantly less memory.

## The NumPy ndarray: A Multidimensional Array Object

- One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. 

- Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements.

Let us see how NumPy enables batch computations

In [None]:
import numpy as np
# Generate some random data
data = np.random.randn(2, 3)
data

In [None]:
data * 10

In [None]:
data + data

In the first example, all of the elements have been multiplied by 10. In the second, the corresponding values in each "cell" in the array have been added to each other.

> An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array:

In [None]:
data

In [None]:
data.shape

In [None]:
data.dtype

>  Whenever you see “array,” “NumPy array,” or “ndarray” in the book text, in most cases they all refer to the ndarray object.


### Creating ndarrays

> Using Array Creation: 

In [None]:
import numpy as np

In [None]:
x1 = np.array([[5, 10, 15],[3, 4, 5]])
x2 = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
x3 = np.array([1, 2, 3])

In [None]:
x1

In [None]:
x1
x1.shape

In [None]:
x2.shape

In [None]:
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
arr2

In [None]:
arr2.ndim

In [None]:
arr2.shape

In [None]:
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1

In [None]:
arr1.ndim

In [None]:
arr1.shape

In [None]:
arr1.dtype
arr2.dtype

In [None]:
np.array((1, 2, 3))

> Note : To create an ndarray, we can pass a list, tuple or any array-like object into the array() method, and it will be converted into an ndarray:

Changing Dimensions

In [None]:
p = np.array([1, 2, 3, 4, 5])
print(p.shape)

In [None]:
p = np.array([1, 2, 3, 4, 5], ndmin = 2)

print(p.shape)


In [None]:
p = np.array([[1, 2, 3, 4, 5]])

print(p.shape)


### Dimensions in Arrays
A dimension in arrays is one level of array depth (nested arrays).

- 0-D arrays, or Scalars, are the elements in an array. Each value in an array is a 0-D array.
  
- An array that has 0-D arrays as its elements is called uni-dimensional or 1-D array (often used to represent matrix or 2nd order tensors).
  
- An array that has 1-D arrays as its elements is called a 2-D array(often used to represent a 3rd order tensor.).
  
- An array that has 2-D arrays (matrices) as its elements is called 3-D array(These are often used to represent a 3rd order tensor.).

In [None]:
import numpy as np

a = np.array(42)
b = np.array([1, 2, 3, 4, 5])
c = np.array([[1, 2, 3], [4, 5, 6]])


print(a.ndim)
print(b.ndim)
print(c.ndim)
# print(d.ndim)     

Other functions for creating an ndarray are:

In [None]:
# np.zeros(10)
# np.zeros((3, 6))
np.empty((2, 3, 2))

> It’s not safe to assume that np.empty will return an array of all zeros. This function returns uninitialized memory and thus may contain non-zero "garbage" values.


arange is an array-valued version of the built-in Python range function:

In [None]:
np.arange(15)

<center>

![](./images/important_function_creationdy.png)

</center>

### Data Types for ndarrays

> The data type or dtype is a special object containing the information (or metadata, data about data) the ndarray needs to interpret a chunk of memory as a particular type of data:

In [None]:
a = 3
type(a)

In [None]:
array = np.arange(10)
array.dtype

In [None]:
type(array)

> Since NumPy is focused on numerical computing, the data type, if not specified, will in many cases be float64 (floating point).

In [None]:
arr1 = np.array([1, 2, 3], dtype=np.float64)
arr2 = np.array([1, 2, 3], dtype=np.int32)

In [None]:
arr1.dtype

In [None]:
arr2.dtype

In [None]:
arr = np.array([1, 2, 3, 4, 5])
arr.dtype

You can explicitly convert or cast an array from one dtype to another using ndarray’s astype method:

In [None]:
float_arr = arr.astype(np.float64)
float_arr.dtype

In [None]:
a = 3.50
int(a)

In [None]:
float_arr

In [None]:
arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
arr
arr.astype(np.int32)

If you have an array of strings representing numbers, you can use astype to convert them to numeric form:

In [None]:
numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.string_)
numeric_strings.astype(float) # note we use float here not float64

> NumPy aliases the Python types to its own equivalent data dtypes

You can also use another array’s dtype attribute:

In [None]:
int_array = np.arange(10)
int_array

In [None]:
int_array.dtype

In [None]:
calibers = np.array([.22, .270, .357, .380, .44, .50], dtype=np.float64)
int_array.astype(calibers.dtype)

There are shorthand type code strings you can also use to refer to a dtype:

In [None]:
empty_uint32 = np.empty(8, dtype='u4')
empty_uint32

<center>

![](./images/numpy_data_type.png)

</center>


### Arithmetic with NumPy Arrays

> Arrays are important because they enable you to express batch operations on data without writing any for loops. NumPy users call this vectorization. Any arithmetic operations between equal-size arrays applies the operation element-wise:

In [None]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr

In [None]:
arr * arr

In [None]:
arr - arr

Arithmetic operations with scalars propagate the scalar argument to each element in the array:

In [None]:
1 / arr

In [None]:
arr ** 0.5

Comparisons between arrays of the same size yield boolean arrays:

In [None]:
arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])
arr2
arr2 > arr

### Basic Indexing 

There are many ways you may want to select a subset of your data or individual elements. One-dimensional arrays are simple; on the surface they act similarly to Python lists:

In [None]:
arr = np.arange(10)
arr

In [None]:
arr[5]

In [None]:
arr

In [None]:
arr[5:8]

In [None]:
arr[5:8] = 12
arr

 if you assign a scalar value to a slice, as in arr[5:8] = 12, the value is propagated (or broadcasted henceforth) to the entire selection

> An important first distinction from Python's built-in lists is that array slices are views on the original array. This means that the data is not copied, and any modifications to the view will be reflected in the source array

In [None]:
a = [1,3,5,7,9]

b = a[1:3]

In [None]:
b

In [None]:
b[1] =100

In [None]:
b

In [None]:
a

In [None]:
arr

In [None]:
arr_slice = arr[5:8]
arr_slice

In [None]:
arr

In [None]:
arr_slice

In [None]:
arr_slice[0] = 12345
arr_slice

In [None]:
arr

> If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the array—for example, arr[5:8].copy().


In [None]:
arr = np.arange(10)
arr_slice = arr[5:8].copy()
arr_slice[1] = 12345
arr

Slicing higher dimensions is also possible:

In [None]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d[0]

Individual elements can be accessed recursively.

In [None]:
arr2d[0][0]

 But that is a bit too much work, so you can pass a comma-separated list of indices to select individual elements. So these are equivalent:

In [None]:
arr2d[0, 0]

In multidimensional arrays, if you omit later indices, the returned object will be a lower dimensional ndarray consisting of all the data along the higher dimensions. So in the 2 × 2 × 3 array arr3d:

In [None]:
arr3d = np.array([
                  [[1, 2, 3], [4, 5, 6]], 
                  [[7, 8, 9], [10, 11, 12]]
                  
                 ])
arr3d

In [None]:
arr3d[0] #is a 2 × 3 array:

In [None]:
arr3d[1] #is a 2 × 3 array:

Both scalar values and arrays can be assigned to arr3d[0]:

In [None]:
old_values = arr3d[0].copy()
old_values

In [None]:
arr3d

In [None]:
arr3d[0] = 42
arr3d

In [None]:
arr3d[0] = old_values
arr3d

> Similarly, arr3d[1, 0] gives you all of the values whose indices start with (1, 0), forming a 1-dimensional array:

In [None]:
arr3d[1, 0]

In [None]:
arr3d[1, 1]

In [None]:
arr3d[0, 1]

Note that in all of these cases where subsections of the array have been selected, the returned arrays are views.

> This expression is the same as though we had indexed in two steps:

In [None]:
arr3d = np.array([
                  [[1, 2, 3], [4, 5, 6]], 
                  [[7, 8, 9], [10, 11, 12]]
                  
                 ])
arr3d

In [None]:
x = arr3d[1]
x
x[0]

In [None]:
x[0]

#### Indexing with slices

Like one-dimensional objects such as Python lists, ndarrays can be sliced with the familiar syntax:

In [None]:
arr = np.array([0,  1,  2,  3,  4, 64, 64, 64,  8,  9])
arr[1:6]

Consider the two-dimensional array from before, arr2d. Slicing this array is a bit different:

In [None]:
arr2d

In [None]:
arr2d[:2]

You can pass multiple slices just like you can pass multiple indexes:

In [None]:
arr2d[:2, 1:]

When slicing like this, you always obtain array views of the same number of dimensions. By mixing integer indexes and slices, you get lower dimensional slices.



In [None]:
arr2d

For example, I can select the second row but only the first two columns like so:



In [None]:
arr2d[1, :2]

Similarly, we can select the third column but only the first two rows like so:



In [None]:
arr2d[:2, 2]

Note that a colon by itself means to take the entire axis, 

In [None]:
arr2d

In [None]:
arr2d[:, :1]

ssigning to a slice expression assigns to the whole selection:



In [None]:
arr2d[:2, 1:] = 0
arr2d

### Boolean Indexing

Let’s consider an example where we have some data in an array and an array of names with duplicates. 


In [None]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
names


In [None]:
data = np.random.randn(7, 4)
data

> Suppose each name corresponds to a row in the data array and we wanted to select all the rows with corresponding name 'Bob'. Like arithmetic operations, comparisons (such as ==) with arrays are also vectorized. Thus, comparing names with the string 'Bob' yields a boolean array:

In [None]:
names == 'Bob'

This boolean array can be passed when indexing the array,The boolean array must be of the same length as the array axis it’s indexing.



In [None]:
data[names == 'Bob']

 You can even mix and match boolean arrays with slices or integers (or sequences of integers; more on this later).

> Below, we select from the rows where names == 'Bob' and index the columns, too:

In [None]:
data[names == 'Bob', 2:]


In [None]:
data[names == 'Bob', 3]

> To select everything but 'Bob', you can either use != or negate the condition using ~:



In [None]:
names != 'Bob'


In [None]:
~(names == 'Bob')

In [None]:
data[~(names == 'Bob')]

In [None]:
cond = names == 'Bob'
data[~cond]

Selecting two of the three names to combine multiple boolean conditions, use boolean arithmetic operators like & (and) and | (or):



In [None]:
mask = (names == 'Bob') | (names == 'Will')
mask


The Python keywords **and** and **or** do not work with boolean arrays. Use & (and) and | (or) instead.



In [None]:
data[mask]

> Selecting data from an array by boolean indexing and assigning the result to a new variable always creates a copy of the data, even if the returned array is unchanged.



Setting values with boolean arrays works by substituting the value or values on the right hand side into the locations where the boolean array's values are True. To set all of the negative values in data to 0 we need only do:



In [None]:
data[data < 0] = 0
data

You can also set whole rows or columns using a one-dimensional boolean array:



In [None]:
data[names != 'Joe'] = 7
data

> these types of operations on two-dimensional data are convenient to do with pandas

### Fancy Indexing

In [None]:
x = np.array([11,3,5])

In [None]:
x[1]

In [None]:
mask = x > 3
mask

In [None]:
x

In [None]:
mask

In [None]:
x[mask]

Fancy indexing is a term adopted by NumPy to describe indexing using integer arrays. Suppose we had an 8 × 4 array:



In [None]:
arr = np.empty((8, 4))
for i in range(8):
    arr[i] = i
arr

To select out a subset of the rows in a particular order, you can simply pass a list or ndarray of integers specifying the desired order:



In [None]:
arr[[4, 3, 0, 6]]

 Using negative indices selects rows from the end:



In [None]:
arr[[-3, -5, -7]]

the result of fancy indexing with multiple integer arrays is always one-dimensional.

Passing multiple index arrays does something slightly different; it selects a one-dimensional array of elements corresponding to each tuple of indices:



In [None]:
arr = np.arange(32).reshape((8, 4))
arr


In [None]:
arr[[1, 5, 7, 2]]

The behavior of fancy indexing in this case is a bit different from what some users might have expected, which is the rectangular region formed by selecting a subset of the matrix’s rows and columns. Here is one way to get that:



In [None]:
arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]]

>  fancy indexing, unlike slicing, always copies the data into a new array.

### Transposing Arrays and Swapping Axes

Transposing is a special form of reshaping that similarly returns a view on the underlying data without copying anything. Arrays have the transpose method and also the special T attribute:



In [None]:
arr = np.arange(15).reshape((3, 5))
arr

In [None]:
arr.T

When doing matrix computations, you may do this very often—for example, when computing the inner matrix product using np.dot:



In [None]:
arr = np.random.randn(6, 3)
arr


In [None]:
arr.shape

In [None]:
arr (6,3) * arr (6,3)

Can we multiply arr * arr?

In [None]:
np.dot(arr, arr)

In [None]:
arr.T

In [None]:
arr.T.shape

In [None]:
np.dot(arr.T, arr)

The @ infix operator is another way to do matrix multiplication:



In [None]:
arr.T @ arr

For higher dimensional arrays, transpose will accept a tuple of axis numbers to permute the axes:



In [None]:
arr = np.arange(16).reshape((2, 2, 4))
arr


In [None]:
arr.transpose((1, 0, 2))

Here, the axes have been reordered with the second axis first, the first axis second, and the last axis unchanged. While it can be difficult to visualize a multidimensional transposition, it is a "reorientation" of the array which does not result in any data being copied or moved around

>Simple transposing with .T is a special case of swapping axes. ndarray has the method swapaxes, which takes a pair of axis numbers and switches the indicated axes to rearrange the data:



In [None]:
arr

In [None]:
arr.shape

In [None]:
arr.swapaxes(1, 2)

In [None]:
arr.swapaxes(1, 2).shape

> swapaxes similarly returns a view on the data without making a copy.



## Universal Functions: Fast Element-Wise Array Functions

A universal function, or ufunc, is a function that performs element-wise operations on data in ndarrays. You can think of them as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results.



In [None]:
arr = np.arange(10)
arr


In [None]:
np.sqrt(arr)

In [None]:
np.exp(arr)

> These are referred to as unary ufuncs. Others, such as add or maximum, take two arrays (thus, binary ufuncs) and return a single array as the result:



In [None]:
x = np.random.randn(8)
y = np.random.randn(8)


In [None]:
x


In [None]:
y


In [None]:
np.maximum(x, y)

Here, numpy.maximum computed the element-wise maximum of the elements in x and y.



>While not common, a ufunc can return multiple arrays. modf is one example, a vectorized version of the built-in Python divmod; it returns the fractional and integral parts of a floating-point array:



In [None]:
arr = np.random.randn(7) * 5
arr


In [None]:
remainder, whole_part = np.modf(arr)


In [None]:
remainder


In [None]:
whole_part

Ufuncs accept an optional out argument that allows them to assign their results into an existing array rather than creating a new one:



In [None]:
arr


In [None]:
out = np.zeros_like(arr)
out

In [None]:
np.add(arr, 1)

In [None]:
np.add(arr, 1, out=out)

## Array-Oriented Programming with Arrays

- vectorization is used to describe operations on whole arrays of data at once rather than going value-by-value using a Python for loop.
  
-  In general, vectorized array operations will usually be significantly faster than their pure Python equivalents, with the biggest impact in any kind of numerical computations. 

In [1]:
import numpy as np
points = np.arange(-5, 5, 0.01) # 1000 equally spaced points
xs, ys = np.meshgrid(points, points)
ys

array([[-5.  , -5.  , -5.  , ..., -5.  , -5.  , -5.  ],
       [-4.99, -4.99, -4.99, ..., -4.99, -4.99, -4.99],
       [-4.98, -4.98, -4.98, ..., -4.98, -4.98, -4.98],
       ...,
       [ 4.97,  4.97,  4.97, ...,  4.97,  4.97,  4.97],
       [ 4.98,  4.98,  4.98, ...,  4.98,  4.98,  4.98],
       [ 4.99,  4.99,  4.99, ...,  4.99,  4.99,  4.99]])

In [None]:
z = np.sqrt(xs ** 2 + ys ** 2)
z

In [None]:
import matplotlib.pyplot as plt
plt.imshow(z, cmap=plt.cm.gray); plt.colorbar()
plt.title("Image plot of $\sqrt{x^2 + y^2}$ for a grid of values")

In [None]:
plt.draw()

In [None]:
plt.close('all')

### Expressing Conditional Logic as Array Operations

The numpy.where function is a vectorized version of the ternary expression x if condition else y. Suppose we had a boolean array and two arrays of values:



In [2]:
xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])

Suppose we wanted to take a value from xarr whenever the corresponding value in cond is True, and otherwise take the value from yarr. A list comprehension doing this might look like:



In [3]:
result = [(x if c else y) for x, y, c in zip(xarr, yarr, cond)]
result

[1.1, 2.2, 1.3, 1.4, 2.5]

- This has multiple problems. 

- First, it will not be very fast for large arrays (because all the work is being done in interpreted Python code). 

- Second, it will not work with multidimensional arrays. With np.where you can do this with a single function call:



In [4]:
result = np.where(cond, xarr, yarr)
result

array([1.1, 2.2, 1.3, 1.4, 2.5])

- The second and third arguments to np.where don’t need to be arrays; one or both of them can be scalars. 

- A typical use of where in data analysis is to produce a new array of values based on another array. 

- Suppose you had a matrix of randomly generated data and you wanted to replace all positive values with 2 and all negative values with –2. This is possible to do with np.where:



In [5]:
arr = np.random.randn(4, 4)
arr


array([[-0.77523069, -0.53038154, -0.27799303, -0.08194996],
       [-0.05980894, -0.80158757,  0.80434135, -0.42096952],
       [ 0.53540782,  0.44087774, -0.47456645,  1.12652094],
       [ 1.45232947,  0.31453353, -1.45150286, -1.51548967]])

In [6]:
np.where(arr > 0, 2, -2)

array([[-2, -2, -2, -2],
       [-2, -2,  2, -2],
       [ 2,  2, -2,  2],
       [ 2,  2, -2, -2]])

You can combine scalars and arrays when using np.where. For example, we can replace all positive values in arr with the constant 2 like so:



In [8]:
arr

array([[-0.77523069, -0.53038154, -0.27799303, -0.08194996],
       [-0.05980894, -0.80158757,  0.80434135, -0.42096952],
       [ 0.53540782,  0.44087774, -0.47456645,  1.12652094],
       [ 1.45232947,  0.31453353, -1.45150286, -1.51548967]])

In [7]:
np.where(arr > 0, 2, arr) # set only positive values to 2

array([[-0.77523069, -0.53038154, -0.27799303, -0.08194996],
       [-0.05980894, -0.80158757,  2.        , -0.42096952],
       [ 2.        ,  2.        , -0.47456645,  2.        ],
       [ 2.        ,  2.        , -1.45150286, -1.51548967]])

### Mathematical and Statistical Methods

- A set of mathematical functions that compute statistics about an entire array or about the data along an axis are accessible as methods of the array class. 

- You can use aggregations (sometimes called reductions) like sum, mean, and std (standard deviation) either by calling the array instance method or using the top-level NumPy function.



In [9]:
arr = np.random.randn(5, 4)
arr


array([[-0.7893682 , -0.91768748,  1.37472908, -0.92796446],
       [ 0.99982114,  1.34702052, -0.38194181,  0.04594524],
       [-0.00803804, -0.91757135, -1.99281639,  1.96969776],
       [ 0.12677296, -0.18232566, -0.33562987, -0.02825102],
       [ 0.6387175 , -0.37321944,  0.72955706,  2.03238459]])

In [13]:
arr.mean()

0.12049160750209817

In [14]:
np.mean(arr)

0.12049160750209817

In [15]:
arr.sum()

2.4098321500419635

Functions like mean and sum take an optional axis argument that computes the statistic over the given axis, resulting in an array with one less dimension:



In [16]:
arr.mean(axis=1) # "compute mean across the columns"


array([-0.31507276,  0.50271127, -0.237182  , -0.1048584 ,  0.75685993])

In [17]:
arr.mean(axis=0) # "compute mean across the columns"


array([ 0.19358107, -0.20875668, -0.12122038,  0.61836242])

In [18]:
arr.sum(axis=0) # "compute sum down the rows."

array([ 0.96790536, -1.0437834 , -0.60610192,  3.09181211])

In [None]:
arr.mean(1) # "compute mean across the columns"
arr.mean(0) # "compute mean across the columns"


Other methods like cumsum and cumprod do not aggregate, instead producing an array of the intermediate results:



In [19]:
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7])
arr.cumsum()

array([ 0,  1,  3,  6, 10, 15, 21, 28])

In multidimensional arrays, accumulation functions like cumsum return an array of the same size, but with the partial aggregates computed along the indicated axis according to each lower dimensional slice:



In [20]:
arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
arr

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [21]:
arr.cumsum(axis=0)

array([[ 0,  1,  2],
       [ 3,  5,  7],
       [ 9, 12, 15]])

In [None]:
arr.cumprod(axis=1)

### Methods for Boolean Arrays

Boolean values are coerced to 1 (True) and 0 (False) in the preceding methods. Thus, sum is often used as a means of counting True values in a boolean array:



In [22]:
arr = np.random.randn(100)
arr

array([ 9.30760902e-01, -3.47948172e-01, -2.53108216e-01, -2.16568985e-01,
       -2.53384015e-01, -6.08845932e-01,  5.87268000e-01, -1.53587503e+00,
       -2.76224012e-01, -1.34927643e+00,  1.20083797e+00, -2.06762653e-01,
        5.24089726e-01,  5.69555194e-01,  8.11561554e-01,  2.24190183e-03,
       -1.75535805e-01, -1.89857170e+00, -2.94170358e+00, -2.26233058e+00,
       -1.59464300e+00, -2.32025375e-02,  1.38860693e+00,  3.03798053e-01,
       -1.29755170e+00, -1.07118325e+00,  8.12915224e-01, -2.79681776e-01,
       -5.58446144e-02, -4.75023410e-01,  7.84438903e-01, -1.72282199e+00,
       -2.20100184e+00, -5.64739340e-01,  5.17342054e-01, -2.79183067e-01,
       -3.92711127e-01, -1.58757221e+00, -2.52111408e+00, -2.02498938e+00,
        7.71804857e-01, -7.09180034e-01, -2.83470780e-01, -1.98889807e+00,
        2.89066611e-01, -6.92173981e-01,  1.33612387e-01, -6.39239730e-01,
        8.47418737e-01,  1.23592718e+00,  4.64094153e-01,  1.08134690e+00,
       -3.43133267e-01,  

In [23]:
(arr > 0).sum() # Number of positive values

50

There are two additional methods, any and all, useful especially for boolean arrays. any tests whether one or more values in an array is True, while all checks if every value is True:



In [24]:
bools = np.array([False, False, True, False])


In [28]:
~(bools.any())


False

In [27]:
~(bools.all())

True

### Sorting


Like Python’s built-in list type, NumPy arrays can be sorted in-place with the sort method:

In [29]:
arr = np.random.randn(6)
arr


array([ 1.84785979,  0.3773606 ,  1.9155765 ,  0.76673193, -0.07013321,
       -1.44843458])

In [34]:
arr.sort()


You can sort each one-dimensional section of values in a multidimensional array in-place along an axis by passing the axis number to sort:



In [35]:
arr = np.random.randn(5, 3)
arr


array([[ 0.09835333,  0.24672353,  1.95284066],
       [-1.26505796, -1.33715706,  0.13081755],
       [-2.68703335, -1.36389307, -1.95272353],
       [-0.92220447,  0.70050284,  1.22472831],
       [ 1.48783064,  0.02674339,  0.48794935]])

In [36]:
arr.sort(1)
arr

array([[ 0.09835333,  0.24672353,  1.95284066],
       [-1.33715706, -1.26505796,  0.13081755],
       [-2.68703335, -1.95272353, -1.36389307],
       [-0.92220447,  0.70050284,  1.22472831],
       [ 0.02674339,  0.48794935,  1.48783064]])

In [37]:
arr.sort(0)
arr

array([[-2.68703335, -1.95272353, -1.36389307],
       [-1.33715706, -1.26505796,  0.13081755],
       [-0.92220447,  0.24672353,  1.22472831],
       [ 0.02674339,  0.48794935,  1.48783064],
       [ 0.09835333,  0.70050284,  1.95284066]])

The top-level method np.sort returns a sorted copy of an array instead of modifying the array in-place. For example:



In [39]:
arr2 = np.array([5, -10, 7, 1, 0, -3])
arr2

array([  5, -10,   7,   1,   0,  -3])

In [40]:
sorted_arr2 = np.sort(arr2)
sorted_arr2

array([-10,  -3,   0,   1,   5,   7])

In [None]:
large_arr = np.random.randn(1000)
large_arr.sort()
large_arr[int(0.05 * len(large_arr))] # 5% quantile

### Unique and Other Set Logic

> NumPy has some basic set operations for one-dimensional ndarrays. A commonly used one is np.unique, which returns the sorted unique values in an array:



In [41]:
a = [1,33,4,4,1]
set(a)

{1, 4, 33}

In [42]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
np.unique(names)


array(['Bob', 'Joe', 'Will'], dtype='<U4')

In [43]:
ints = np.array([3, 3, 3, 2, 2, 1, 1, 4, 4])
np.unique(ints)

array([1, 2, 3, 4])

Contrast np.unique with the pure Python alternative:



In [None]:
sorted(set(names))

Another function, np.in1d, tests membership of the values in one array in another, returning a boolean array:



In [45]:
values1 = np.array([6, 0, 0, 3, 2, 5, 6])

value2 = np.array([2, 3, 6])
np.in1d(values1, value2)

array([ True, False, False,  True,  True, False,  True])

In [46]:
np.isin(values1, value2)

array([ True, False, False,  True,  True, False,  True])


unique(x)	Compute the sorted, unique elements in x

intersect1d(x, y)	Compute the sorted, common elements in x and y

union1d(x, y)	Compute the sorted union of elements

in1d(x, y)	Compute a boolean array indicating whether each element of x is contained in y

setdiff1d(x, y)	Set difference, elements in x that are not in y

setxor1d(x, y)	Set symmetric differences; elements that are in either of the arrays, but not both

## File Input and Output with Arrays

- NumPy is able to save and load data to and from disk in some text or binary formats.
- most users will prefer pandas and other tools for loading text or tabular data 

> np.save and np.load are the two workhorse functions for efficiently saving and loading array data on disk. Arrays are saved by default in an uncompressed raw binary format with file extension .npy:



In [47]:
arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [48]:
np.save('some_array', arr)

If the file path does not already end in .npy, the extension will be appended. The array on disk can then be loaded with np.load:



In [49]:
np.load('some_array.npy')

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

You save multiple arrays in an uncompressed archive using np.savez and passing the arrays as keyword arguments:



In [50]:
np.savez('array_archive.npz', a=arr, b=arr)

When loading an .npz file, you get back a dict-like object that loads the individual arrays lazily:



In [51]:
arch = np.load('array_archive.npz')


In [52]:
arch['b']

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [53]:
arch['a']

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

If your data compresses well, you may wish to use numpy.savez_compressed instead:



In [55]:
np.savez_compressed('arrays_compressed.npz', a=arr, b=arr)

In [56]:
!rm some_array.npy
!rm array_archive.npz
!rm arrays_compressed.npz

rm: some_array.npy: No such file or directory
rm: array_archive.npz: No such file or directory


## Linear Algebra

- Linear algebra operations, like matrix multiplication, decompositions, determinants, and other square matrix math, are an important part of many array libraries. 
 
- Unlike some languages like MATLAB, multiplying two two-dimensional arrays with * is an element-wise product instead of a matrix dot product.
  
- Thus, there is a function dot, both an array method and a function in the numpy namespace, for matrix multiplication:



In [57]:
x = np.array([[1., 2., 3.], [4., 5., 6.]])
y = np.array([[6., 23.], [-1, 7], [8, 9]])
x
y

array([[ 6., 23.],
       [-1.,  7.],
       [ 8.,  9.]])

In [58]:
x.dot(y)

array([[ 28.,  64.],
       [ 67., 181.]])

> x.dot(y) is equivalent to np.dot(x, y):



In [59]:
np.dot(x, y)

array([[ 28.,  64.],
       [ 67., 181.]])

A matrix product between a two-dimensional array and a suitably sized one-dimensional array results in a one-dimensional array:



In [60]:
x

array([[1., 2., 3.],
       [4., 5., 6.]])

In [61]:
x.shape

(2, 3)

In [62]:
np.ones(3)

array([1., 1., 1.])

In [63]:
np.ones(3).shape

(3,)

In [64]:
np.dot(x, np.ones(3))

array([ 6., 15.])

The @ symbol (as of Python 3.5) also works as an infix operator that performs matrix multiplication:



In [65]:
x @ np.ones(3)

array([ 6., 15.])

> numpy.linalg has a standard set of matrix decompositions and things like inverse and determinant.


 These are implemented under the hood via the same battle-tested optimized linear algebra libraries used in other languages like MATLAB and R, such as BLAS, LAPACK, or possibly (depending on your NumPy build) the proprietary Intel MKL (Math Kernel Library):


In [66]:
from numpy.linalg import inv, qr
X = np.random.randn(5, 5)
X

array([[ 0.97722907, -0.03482332, -0.19482542, -0.17912341,  2.30819686],
       [-1.67740133,  0.29133724, -0.21880788, -0.62624286, -0.34873465],
       [ 1.69210321,  0.2667365 ,  1.18642531, -0.97655539, -1.46112024],
       [-0.28962418,  0.58813763,  0.57687767,  0.48000825, -1.50240726],
       [-0.30544613,  1.31421328,  2.55707471, -1.77361718, -0.91674592]])

In [69]:
mat = X.T.dot(X) #computes the dot product of X with its transpose X.T.

mat

array([[ 6.80904464, -0.6431344 ,  1.23606734, -0.37429399,  1.08338876],
       [-0.6431344 ,  2.23030085,  3.95932555, -2.48529351, -2.66013451],
       [ 1.23606734,  3.95932555,  8.36485779, -5.24505097, -5.31779258],
       [-0.37429399, -2.48529351, -5.24505097,  4.75405158,  2.13659378],
       [ 1.08338876, -2.66013451, -5.31779258,  2.13659378, 10.68191163]])

In [70]:
inv(mat)


array([[ 0.54710133,  1.82471882, -1.20645936, -0.26747332, -0.14818927],
       [ 1.82471882,  8.9195757 , -5.2871167 , -0.83371943, -0.42914406],
       [-1.20645936, -5.2871167 ,  3.82011194,  1.14024512,  0.47940344],
       [-0.26747332, -0.83371943,  1.14024512,  0.92017189,  0.20310271],
       [-0.14818927, -0.42914406,  0.47940344,  0.20310271,  0.19981309]])

In [71]:
q, r = qr(mat)
r

array([[ -7.04403297,   0.40761456,  -1.76199361,   0.97928192,
         -1.88634346],
       [  0.        ,  -5.84388333, -11.81653876,   7.52361105,
         10.37683622],
       [  0.        ,   0.        ,  -0.47878498,  -0.32265109,
          4.99586928],
       [  0.        ,   0.        ,   0.        ,  -1.82361149,
          4.1313129 ],
       [  0.        ,   0.        ,   0.        ,   0.        ,
          1.39059555]])

In [72]:
q

array([[-0.96664009,  0.04262881, -0.07639906, -0.12444883, -0.20607134],
       [ 0.09130201, -0.37527868,  0.65644488, -0.25254806, -0.59676582],
       [-0.17547722, -0.68975583,  0.1981234 , -0.09879848,  0.66665629],
       [ 0.05313632,  0.42898746,  0.17184792, -0.83895517,  0.28243373],
       [-0.15380234,  0.44447199,  0.70317758,  0.45510993,  0.27785919]])

## Pseudorandom Number Generation

- The numpy.random module supplements the built-in Python random with functions for efficiently generating whole arrays of sample values from many kinds of probability distributions.




#### random.random

> Return random floats in the half-open interval [0.0, 1.0). Alias for random_sample to ease forward-porting to the new random API.


In [73]:
import numpy as np
np.random.random(4)

array([0.43017196, 0.67164888, 0.48398242, 0.23456037])

In [None]:
np.random.random(3) #Return random floats in the half-open interval [0.0, 1.0). 

In [74]:
np.random.normal(3) #Draw random samples from a normal (Gaussian) distribution.

2.95484663847779

In [None]:
#Draw samples from the distribution:
mu, sigma = 0, 0.1 # mean and standard deviation
s = np.random.normal(mu, sigma, 1000)
s

In [None]:
n, p = 10, .5  # number of trials, probability of each trial
s = np.random.binomial(n, p, 1000)
# result of flipping a coin 10 times, tested 1000 times.
s

In [None]:
samples = np.random.normal(size=(4, 4))
samples

- We say that these are pseudorandom numbers because they are generated by an algorithm with deterministic behavior based on the seed of the random number generator. 

- You can change NumPy's random number generation seed using np.random.seed:



In [77]:
np.random.randn(10)

array([ 0.11563019, -1.85177551, -0.65171651, -1.65323019, -1.2658211 ,
        1.09356434, -0.0210074 ,  0.6079142 ,  1.80136176,  0.1925204 ])

In [85]:
np.random.seed(1234)
np.random.randn(10)

array([ 0.47143516, -1.19097569,  1.43270697, -0.3126519 , -0.72058873,
        0.88716294,  0.85958841, -0.6365235 ,  0.01569637, -2.24268495])

> The data generation functions in numpy.random use a global random seed. To avoid global state, you can use numpy.random.RandomState to create a random number generator isolated from others:

In [86]:
rng = np.random.RandomState(1234)
rng.randn(10)

array([ 0.47143516, -1.19097569,  1.43270697, -0.3126519 , -0.72058873,
        0.88716294,  0.85958841, -0.6365235 ,  0.01569637, -2.24268495])

#### numpy.random.Generator.random

> Return random floats in the half-open interval [0.0, 1.0).



In [87]:
rng = np.random.default_rng()
rng.random()

0.530078415536801

In [None]:
type(rng.random())

In [None]:
rng.random((5,))

In [None]:
rng.random((3, 2)) # 3 * 2 array of random numbers

In [None]:
# Do this (new version)

from numpy.random import default_rng
rng = default_rng()
vals = rng.standard_normal(10)
more_vals = rng.standard_normal(10)


# instead of this (legacy version)
from numpy import random
vals = random.standard_normal(10)
more_vals = random.standard_normal(10)

## Example: Random Walks

In [None]:
import random
position = 0
walk = [position]
steps = 1000
for i in range(steps):
    step = 1 if random.randint(0, 1) else -1
    position += step
    walk.append(position)

In [None]:
plt.figure()

In [None]:
plt.plot(walk[:100])

In [None]:
np.random.seed(12345)

In [None]:
nsteps = 1000
draws = np.random.randint(0, 2, size=nsteps)
steps = np.where(draws > 0, 1, -1)
walk = steps.cumsum()

In [None]:
walk.min()
walk.max()

In [None]:
(np.abs(walk) >= 10).argmax()

### Simulating Many Random Walks at Once

In [None]:
nwalks = 5000
nsteps = 1000
draws = np.random.randint(0, 2, size=(nwalks, nsteps)) # 0 or 1
steps = np.where(draws > 0, 1, -1)
walks = steps.cumsum(1)
walks

In [None]:
walks.max()
walks.min()

In [None]:
hits30 = (np.abs(walks) >= 30).any(1)
hits30
hits30.sum() # Number that hit 30 or -30

In [None]:
crossing_times = (np.abs(walks[hits30]) >= 30).argmax(1)
crossing_times.mean()

In [None]:
steps = np.random.normal(loc=0, scale=0.25,
                         size=(nwalks, nsteps))

## Conclusion