# Lesson 04 - NumPy

### The following topics are discussed in this notebook:
* Create NumPy arrays.
* Array operations.
* Boolean masking. 

### Additional Resources
* [Python Data Science Handbook, Ch 2](https://jakevdp.github.io/PythonDataScienceHandbook/02.00-introduction-to-numpy.html)
* [DataCamp: Intro to Python for Data Science, Ch 4](https://www.datacamp.com/courses/intro-to-python-for-data-science)





### Packages 
A **package** is a pre-built set of functions and data types that can be loaded into a Python session to extend the language's functionality. 

The following block of code imports the `math` package, which contains many useful mathematical functions and constants. 

In [1]:
import math

The `math` packages contains functions the following functions (along with many others):

* **`sqrt()`** which is used to calculate the square root of a number. 
* **`factorial()`** which is used to calculate the factorial of an integer.

It also contains an object named `pi` which contains the value of the constant `pi`.

To access any of these items within the math package, we much precede its name with `math.`. 

In [2]:
print(math.sqrt(20))
print(math.factorial(5))
print(math.pi)

4.47213595499958
120
3.141592653589793


When the name of a package is long, it can become tedius to type its entire name every time you wish to use a function from it. Fortunately, we are able to rename packages when we import them. The following code imports the `math` package under the name `mt`. 

In [3]:
import math as mt

In [4]:
print(mt.sqrt(40))

6.324555320336759


### NumPy

**NumPy**, which is short for "Numerical Python" is a package that provides additional functionality for performing numerical calculations involving lists. It can greatly simplify certain types of tasks relating to lists that would otherwise require loops. In the next cell, we will import NumPy under the name `np`. 

In [5]:
import numpy as np

At the core of NumPy is a new data type called an **array**. Arrays are similar to lists, and in many ways, arrays and lists behave the same. In the following cell, we create a list and an array, each containing the same elements. 

In [6]:
my_list = [4, 1, 7, 3, 5]
my_array = np.array([4, 1, 7, 3, 5])

In the next few cells, we show that lists and arrays can behave in very similar ways. 

In [7]:
print(my_list[3])
print(my_array[3])

3
3


In [8]:
print(my_list[:3])
print(my_array[:3])

[4, 1, 7]
[4 1 7]


In [9]:
print(len(my_list))
print(len(my_array))

5
5


In [10]:
print(type(my_list))
print(type(my_array))

<class 'list'>
<class 'numpy.ndarray'>


### Array Operations

The difference between arrays and lists is that certain types of operations can be performed more easily on arrays than on lists. Assume that we would like to print out a list/array that contains 5 times the elements in our previously defined list/array. 

In [11]:
print(5 * my_array)


[20  5 35 15 25]


In [12]:
print(5 * my_list)

[4, 1, 7, 3, 5, 4, 1, 7, 3, 5, 4, 1, 7, 3, 5, 4, 1, 7, 3, 5, 4, 1, 7, 3, 5]


In [13]:
print(5.2 * my_list)

TypeError: can't multiply sequence by non-int of type 'float'

In [14]:
temp = []
for i in range(0, len(my_list)):
    temp.append(5 * my_list[i])
print(temp)

[20, 5, 35, 15, 25]


We can perform other types of operations on NumPy arrays:

In [15]:
print(my_array ** 2)

[16  1 49  9 25]


In [16]:
print(my_array +  100)

[104 101 107 103 105]


NumPy also includes a meaningful way to multiply two arrays, as long as they are of the same length. 

In [17]:
array1 = np.array([2,1,4])
array2 = np.array([3,9,2])

print(array1 * array2)

[6 9 8]


In [1]:
array1 = np.array([2,1,4])
array2 = np.array([3,9,2,7])

print(array1 * array2)

NameError: name 'np' is not defined

In [19]:
y_actual = [3.1, 4.5, 6.4, 7.2]
y_pred = [2.9, 4.4, 6.7, 7.1] # THEY ARE LISTS

In [23]:
errors = []
sse = 0 # sum square error
for i in range(0, len(y_actual)):
    temp = y_actual[i] - y_pred[i]
    errors.append(temp)
    sse += temp**2
    print(errors)    
    print(sse)        

[0.20000000000000018]
0.04000000000000007
[0.20000000000000018, 0.09999999999999964]
0.05
[0.20000000000000018, 0.09999999999999964, -0.2999999999999998]
0.1399999999999999
[0.20000000000000018, 0.09999999999999964, -0.2999999999999998, 0.10000000000000053]
0.15000000000000002


In [24]:
y_actual = np.array(y_actual)
y_pred = np.array(y_pred)

errors = y_actual - y_pred
#sse = sum(errors**2) this is also good
sse = np.sum(errors**2) 
print(errors)
print(sse)

[ 0.2  0.1 -0.3  0.1]
0.15


### Boolean Masking

**Boolean masking** is a tool for creating subset of NumPy arrays. We will explain this concept in steps.

In the cell below, we create two NumPy arrays. The array `bool_array` contains boolean values, while the other, `my_array`, contains numerical values. 

We will pass `bool_list` to `my_array` as if it were an index, and will store the result in `sub_array`. 

In [25]:
bool_array = np.array([True, True, False, True, False])
my_array = np.array([1,2,3,4,5])

sub_array = my_array[bool_array]
print(sub_array)

[1 2 4]


Unlike lists, we can perform numerical comparisons with arrays. The comparison is carried out for each element of the array, and the result is an array of boolean values, containing the results of each comparison. 

In [24]:
some_array = np.array([4, 7, 6, 3, 9, 8])
print(some_array < 5)

[ True False False  True False False]


In [25]:
print(some_array % 2 == 0)

[ True False  True False False  True]


We can combine the concept of array comparisons and passing boolean arrays to create subsets of arrays by picking out the elements that satisfy certain conditions. This process is called **boolean masking**. 

In [26]:
sel = some_array % 2 == 0
print(some_array[sel])

[4 6 8]


In [27]:
sel = some_array > 5
print(some_array[sel])

[7 6 9 8]


In [28]:
print(some_array[some_array > 5])

[7 6 9 8]


### Using Boolean Masks to Count

Since Python treats `True` as being equal to 1 and `False` as being equal to 0, we can use the sum function along with Boolean masking to count the number of elements in an array that satisfy a certain critera. 

In [29]:
cat = np.array(['A', 'C', 'A', 'B', 'B', 'C', 'A', 'A' ,'C', 'B', 'C', 'C', 'A', 'B', 'A', 'A'])

In [35]:
print(np.sum(cat == 'A'))
print(sum(cat == 'B'))
print(sum(cat == 'C'))

7
4
5


In [36]:
val = np.array([8, 1, 3, 6, 10, 6, 12, 4, 6, 1, 4, 8, 5, 4, 12, 4])

In [37]:
print(np.sum(val > 5) )
print(np.sum(val < 5) )
print(np.sum(val % 2 == 0) )
print(sum(val % 2 != 0) )

8
7
12
4


In [38]:
sum( (val > 5) & (val % 2 == 0) )

8

In [42]:
sum( (val > 7) & (val % 2 == 0) & (val % 3 == 0))

2

## Random Number Generation

We can use NumPy to draw random samples from a set. 

In [46]:
sample1 = np.random.choice(['A', 'B', 'C', 'D', 'E'], 10)
print(sample1) 

['B' 'A' 'C' 'C' 'A' 'B' 'D' 'A' 'E' 'B']


In [49]:
sample2 = np.random.choice(['A', 'B', 'C', 'D', 'E'], 3, replace=False)
print(sample2)

['A' 'E' 'C']


In [48]:
sample3 = np.random.choice(['A', 'B', 'C', 'D', 'E'], 1)
print(sample3) # return a list

['D']


In [50]:
sample3 = np.random.choice(['A', 'B', 'C', 'D', 'E'])
print(sample3) # return a array

D


We can generate random numbers according to a distribution, such as the normal or uniform distribution. 

In [52]:
x1 = np.random.uniform(5, 15, 20)#from 5 to 15, pick up 20 elements
print(x1)

[ 14.05434383   8.90285961  10.77460891   7.55315462  14.2620902
   8.26229843   6.8879158    5.52830994   7.70331564   6.05979708
  12.46813057  12.2056303    8.31676938  10.92545616   5.95622732
   6.41381503  13.06856928  12.11865923   7.092778    10.06650911]


In [66]:
x2 = np.random.normal(10, .1, 20) #mean 10, std div .1, 
print(x2)

[ 10.00961716  10.18768154  10.05337806  10.079836     9.96939255
  10.03779739   9.9773522    9.81142555  10.2472511    9.98531434
  10.01156492  10.13304306  10.01095947  10.02154791  10.04290701
   9.88487344  10.15247535  10.03605358   9.9389107   10.0930246 ]


We can set the seed using. `np.random.seed()`.

In [58]:
np.random.seed(32)
x3 = np.random.normal(10,5,20)
print(x3)

[  8.25552775  14.91851717  12.90461415  10.3514222   13.88766338
  12.90979373  17.35895263  18.31590504   8.69411439   6.55661594
   6.52538369  19.7021173   19.02707595  12.28156926   7.1259398
  10.57090252  17.56790385  11.7556592    9.55533393  14.58477114]


In [62]:
np.random.seed(8)
x3 = np.random.normal(10,5,20)
print(x3)

[ 10.45602358  15.45641367   0.26514845   3.06825234  -1.48245787
  22.04917152  18.63918084  21.02278142  13.9741382   14.88210548
   4.08286426  19.58181805   4.383366     6.67982265   8.10820715
   6.04192364  14.29774055   8.84605502   9.67169486   8.95681883]


In [63]:
np.random.seed(8)
x3 = np.random.normal(10,5,20)
print(x3)

[ 10.45602358  15.45641367   0.26514845   3.06825234  -1.48245787
  22.04917152  18.63918084  21.02278142  13.9741382   14.88210548
   4.08286426  19.58181805   4.383366     6.67982265   8.10820715
   6.04192364  14.29774055   8.84605502   9.67169486   8.95681883]
