# Introduction to NumPy

## Meet NumPy

### Welcome

NumPy: Numerical Python

Why NumPy?

* It's faster.
* It depends on array programming paradigm.  You don't have to loop over data.
* Additional concepts:
    * Linear algebra
    * Matrix multiplication
    * Fourier transformations

## Getting Setup

Best way: use Anaconda, a distribution of all popular Python data libraries.

Create a new environment:

    conda create -n 100days numpy jupyter
    
Activate environment:

    conda activate 100days
    
where "100days" is the name of the environment.

Open up Jupyter server:

    jupyter notebook

In [1]:
# Import as np
import numpy as np

In [2]:
np.__version__

'1.16.2'

### Introducing Arrays

Arrays:
* More restrictive than lists; every item must be same data type.
* Arrays length can't be changed; size cannot be changed and must be defined when you create array.
* Less overhead than lists.
* Elements are contiguous.

#### Lists

In [3]:
gpas_as_list = [4.0, 3.286, 3.5]

In [4]:
# Can append
gpas_as_list.append(4.0)

In [5]:
# Can insert different data types
gpas_as_list.insert(1, "Whatevs")

In [6]:
# Can remove
gpas_as_list.pop(1)

'Whatevs'

In [7]:
gpas_as_list

[4.0, 3.286, 3.5, 4.0]

#### Arrays

In [8]:
gpas = np.array(gpas_as_list)
gpas

array([4.   , 3.286, 3.5  , 4.   ])

In [9]:
# Peep at documentation:
?gpas

In [10]:
# Describe format of elements in array:
gpas.dtype

dtype('float64')

In [11]:
# Memory use (bytes) - 64 bits = 8 bytes:
gpas.itemsize

8

In [12]:
# No. of elements:
gpas.size

4

In [13]:
len(gpas)

4

In [14]:
# Total number of bytes (8 bytes * 4 elements):
gpas.nbytes

32

#### Differences between lists and NumPy Arrays
* An array's size is immutable.  You cannot append, insert or remove elements, like you can with a list.
* All of an array's elements must be of the same [data type](https://docs.scipy.org/doc/numpy-1.14.0/user/basics.types.html).
* A NumPy array behaves in a Pythonic fashion.  You can `len(my_array)` just like you would assume.

#### Quiz
* Create a new array by passing an iterable, and an array is Pythonic, so you can pass it to the len function.
* Create array with elements 1, 2, 3: np.array([1, 2, 3])
* A list can have items appended, inserted, and removed; an array cannot. 

### Creating the Study Log

Create an array populated with zeros, e.g., 100 elements:

In [15]:
# Default is floating point:
study_min = np.zeros(100)
study_min

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [16]:
# Jupyter Notebook special command
%whos

Variable       Type       Data/Info
-----------------------------------
gpas           ndarray    4: 4 elems, type `float64`, 32 bytes
gpas_as_list   list       n=4
np             module     <module 'numpy' from 'C:\<...>ges\\numpy\\__init__.py'>
study_min      ndarray    100: 100 elems, type `float64`, 800 bytes


In [17]:
minutes_in_day = 24 * 60
minutes_in_day

1440

In [18]:
# Unsigned integers take less space:
study_minutes = np.zeros(100, np.uint16)
study_minutes

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint16)

In [19]:
%whos

Variable         Type       Data/Info
-------------------------------------
gpas             ndarray    4: 4 elems, type `float64`, 32 bytes
gpas_as_list     list       n=4
minutes_in_day   int        1440
np               module     <module 'numpy' from 'C:\<...>ges\\numpy\\__init__.py'>
study_min        ndarray    100: 100 elems, type `float64`, 800 bytes
study_minutes    ndarray    100: 100 elems, type `uint16`, 200 bytes


In [20]:
study_minutes[0] = 150
first_day_minutes = study_minutes[0]
first_day_minutes

150

In [21]:
type(first_day_minutes)

numpy.uint16

Data types:
* Scalar: representing a singular value; single element
* Vector: multiple elements

In [22]:
study_minutes[1] = 60
second_day_minutes = study_minutes[1]
second_day_minutes

60

In [23]:
# To add 4 elements, start with the actual array location, then add 4 to get 6:
study_minutes[2:6] = [80, 60, 30, 90]
study_minutes

array([150,  60,  80,  60,  30,  90,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0], dtype=uint16)

ndarray: n-dimensional array

#### Quiz
* Vector: single dimension array
* Scalar: one of the elements in the array

### Multidimensional Arrays

Rank: number of dimensions an array has

Matrix: array with 2 dimensions (vector: 1)

In [24]:
students_gpas = np.array([
    [4.0, 3.286, 3.5, 4.0],
    [3.2, 3.8, 4.0, 4.0],
    [3.96, 3.92, 4.0, 4.0],
], np.float16)
students_gpas

array([[4.   , 3.285, 3.5  , 4.   ],
       [3.2  , 3.8  , 4.   , 4.   ],
       [3.96 , 3.92 , 4.   , 4.   ]], dtype=float16)

#### About data types

* By choosing the proper [data type](https://docs.scipy.org/doc/numpy-1.14.0/user/basics.types.html) you can greatly reduce the size required to store objects
* Data types are maintained by wrapping values in a [scalar representation](https://docs.scipy.org/doc/numpy-1.14.0/reference/arrays.scalars.html)
* `np.zeros` is a handy way to create an empty array filled with zeros.

In [25]:
# Get the number of dimensions:
students_gpas.ndim

2

In [26]:
# Get the number of elements in each dimension, i.e., length of axes
students_gpas.shape

(3, 4)

In [27]:
# Total number of elements
students_gpas.size

12

In [52]:
# How many bytes used for each item:
students_gpas.itemsize

2

In [53]:
# Total size
total_size = students_gpas.size * students_gpas.itemsize
total_size

24

In [30]:
%whos ndarray

Variable        Type       Data/Info
------------------------------------
gpas            ndarray    4: 4 elems, type `float64`, 32 bytes
students_gpas   ndarray    3x4: 12 elems, type `float16`, 24 bytes
study_min       ndarray    100: 100 elems, type `float64`, 800 bytes
study_minutes   ndarray    100: 100 elems, type `uint16`, 200 bytes


In [31]:
# NumPy function: get information on any NumPy object
np.info(students_gpas)

class:  ndarray
shape:  (3, 4)
strides:  (8, 2)
itemsize:  2
aligned:  True
contiguous:  True
fortran:  False
data pointer: 0x25bfd6a45d0
byteorder:  little
byteswap:  False
type: float16


In [32]:
students_gpas[2]

array([3.96, 3.92, 4.  , 4.  ], dtype=float16)

In [33]:
students_gpas[2][3]

4.0

#### Multidimensional Arrays

* The data structure is actually called `ndarray`, representing any **n**umber of **d**imensions
* Arrays can have multiple dimensions, you declare them on creation
* Dimensions help define what each element in the array represents.  A two dimensional array is just an array of arrays
* **Rank** defines how many dimensions an array contains 
* **Shape** defines the length of each of the array's dimensions
* Each dimension is also referred to as an **axis**, and they are zero-indexed. Multiples are called **axes**.
* A 2d array is AKA **matrix**.

## Array Organization

### Indexing

In [34]:
study_minutes

array([150,  60,  80,  60,  30,  90,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0], dtype=uint16)

In [35]:
study_minutes = np.array([
    study_minutes,
    np.zeros(100, np.uint16)
])

In [36]:
study_minutes.shape

(2, 100)

In [37]:
# Set round 2 day 1 to 60
study_minutes[1][0] = 60
study_minutes

array([[150,  60,  80,  60,  30,  90,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0],
       [ 60,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   

In [38]:
# NumPy shortcut: use tuple instead
study_minutes[1, 0]

60

#### Fancy Indexing

Resulting array shape matches index array layout.  Be careful to distinguish between tuple shortcut and fancy indexing.

RandomState: seed randomness in a way that's repeatable.

In [39]:
# Random number generator
rand = np.random.RandomState(42)
fake_log = rand.randint(30, 180, size=100, dtype=np.uint16)
fake_log

array([132, 122, 128,  44, 136, 129, 101,  95,  50, 132, 151,  64, 104,
       175, 117, 146, 139, 129, 133, 176,  98, 160, 179,  99,  82, 142,
        31, 106, 117,  56,  98,  67, 121, 159,  81, 170,  31,  50,  49,
        87, 179,  51, 116, 177, 118,  78, 171, 117,  88, 123, 102,  44,
        79,  31, 108,  80,  59, 137,  84,  93, 155, 160,  67,  80, 166,
       164,  70,  50, 102, 113,  47, 131, 161, 118,  82,  89,  81,  43,
        81,  38, 119,  52,  82,  31, 159,  57, 113,  71, 121, 140,  91,
        70,  37, 106,  64, 127, 110,  58,  93,  79], dtype=uint16)

In [40]:
[fake_log[3], fake_log[8]]

[44, 50]

In [41]:
[fake_log[[3, 8]]]

[array([44, 50], dtype=uint16)]

In [42]:
index = np.array([
    [3, 8],
    [0, 1]
])
fake_log[index]

array([[ 44,  50],
       [132, 122]], dtype=uint16)

In [43]:
# Error: all input arrays must have same number of dimensions
# study_minutes = np.append(study_minutes, fake_log, axis=0)

# Solve by making a list of lists:
study_minutes = np.append(study_minutes, [fake_log], axis=0)

In [44]:
study_minutes[1, 1] = 360
study_minutes

array([[150,  60,  80,  60,  30,  90,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0],
       [ 60, 360,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   

#### Code Challenge

In [46]:
quarterly_revenue = np.array([22.72, 29.13, 25.36, 35.75])
quarterly_revenue[[1, 3]]

array([29.13, 35.75])

In [47]:
quarterly_revenue_by_year = np.array([
    [22.72, 29.13, 25.36, 35.75],
    [29.13, 30.4, 32.71, 43.74],
    [35.71, 37.96, 43.74, 60.5]
])
quarterly_revenue_by_year[1, 3]

43.74

#### Creation 
* You can create a random but bound grouping of values using the `np.random` package.  
  * `RandomState` let's you seed your randomness in a way that is repeatable.
* You can append a row in a couple of ways
   * You can use the `np.append` method.  Make sure the new row is the same shape.
   * You can create/reassign a new array by including the existing array as part of the iterable in creation.


#### Indexing
* You can use an indexing shortcut by separating dimensions with a comma.  
* You can index using a `list` or `np.array`.  Values will be pulled out at that specific index.  This is known as fancy indexing.
  * Resulting array shape matches the index array layout.  Be careful to distinguish between the tuple shortcut and fancy indexing.

### Boolean Array Indexing

In [49]:
fake_log

array([132, 122, 128,  44, 136, 129, 101,  95,  50, 132, 151,  64, 104,
       175, 117, 146, 139, 129, 133, 176,  98, 160, 179,  99,  82, 142,
        31, 106, 117,  56,  98,  67, 121, 159,  81, 170,  31,  50,  49,
        87, 179,  51, 116, 177, 118,  78, 171, 117,  88, 123, 102,  44,
        79,  31, 108,  80,  59, 137,  84,  93, 155, 160,  67,  80, 166,
       164,  70,  50, 102, 113,  47, 131, 161, 118,  82,  89,  81,  43,
        81,  38, 119,  52,  82,  31, 159,  57, 113,  71, 121, 140,  91,
        70,  37, 106,  64, 127, 110,  58,  93,  79], dtype=uint16)

In [48]:
fake_log < 60

array([False, False, False,  True, False, False, False, False,  True,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False,  True,
       False, False,  True, False, False, False, False, False, False,
        True,  True,  True, False, False,  True, False, False, False,
       False, False, False, False, False, False,  True, False,  True,
       False, False,  True, False, False, False, False, False, False,
       False, False, False, False,  True, False, False,  True, False,
       False, False, False, False, False,  True, False,  True, False,
        True, False,  True, False,  True, False, False, False, False,
       False, False,  True, False, False, False, False,  True, False,
       False])

In [50]:
fake_log[fake_log < 60]

array([44, 50, 31, 56, 31, 50, 49, 51, 44, 31, 59, 50, 47, 43, 38, 52, 31,
       57, 37, 58], dtype=uint16)

In [51]:
results = []
for value in fake_log:
    if value < 60:
        results.append(value)
np.array(results)

array([44, 50, 31, 56, 31, 50, 49, 51, 44, 31, 59, 50, 47, 43, 38, 52, 31,
       57, 37, 58], dtype=uint16)

In [55]:
study_minutes < 60

array([[False, False, False, False,  True, False,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True],
       [False, False,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  Tru

In [57]:
# Returns 1 dimensional array, not 3 dimensional
study_minutes[study_minutes < 60]

array([30,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0, 44, 50, 31, 56, 31, 50, 49, 51, 44, 31, 59,
       50, 47, 43, 38, 52, 31, 57, 37, 58], dtype=uint16)

In [58]:
np.array([False, True, True]) & np.array([True, False, True])

array([False, False,  True])

In [60]:
study_minutes[(study_minutes < 60) & (study_minutes > 0)]

array([30, 44, 50, 31, 56, 31, 50, 49, 51, 44, 31, 59, 50, 47, 43, 38, 52,
       31, 57, 37, 58], dtype=uint16)

In [67]:
# Set to 0:
study_minutes[study_minutes < 60] = 0
study_minutes[2]

array([132, 122, 128,   0, 136, 129, 101,  95,   0, 132, 151,  64, 104,
       175, 117, 146, 139, 129, 133, 176,  98, 160, 179,  99,  82, 142,
         0, 106, 117,   0,  98,  67, 121, 159,  81, 170,   0,   0,   0,
        87, 179,   0, 116, 177, 118,  78, 171, 117,  88, 123, 102,   0,
        79,   0, 108,  80,   0, 137,  84,  93, 155, 160,  67,  80, 166,
       164,  70,   0, 102, 113,   0, 131, 161, 118,  82,  89,  81,   0,
        81,   0, 119,   0,  82,   0, 159,   0, 113,  71, 121, 140,  91,
        70,   0, 106,  64, 127, 110,   0,  93,  79], dtype=uint16)

#### Code Challenge

In [69]:
cat_counts = np.array([1, 0, 2, 6, 5, 2, 1, 3, 18, 1, 2])

In [70]:
cat_counts[cat_counts > 2]

array([ 6,  5,  3, 18])

In [72]:
potential_adoptees = cat_counts[(cat_counts >= 1) & (cat_counts < 5)]

In [73]:
potential_adoptees

array([1, 2, 2, 1, 3, 1, 2])

Remember: & does element by element comparisons, while and tries to make things a scalar truthy value.