# Numpy

List s are a great way to store information and are mutable but they are not ideal to performe calculations over a large number of values quickly, especially if different values are stored in different objects. This can be solved by using NumPy (Numeric Python) and the **NumPy Array**. It behaves similarly to a list but you can do calculations over an entire array quickly and efficiently. Simply install it with `pip3 install numpy`

After installing and importing NumPy, you can convert a list to an array with the `array` function-just like changing its type. After this is done, you can do **element-wise calculations**, meaning that the first element of the first array will be used with the first element of the second array!

In [43]:
# Lists with information
height = [1.65, 1.78, 1.58, 1.80, 1.57]
weight = [56, 67, 63, 93, 57]

# Try to calculate body mass index with lists, it will fail!
# bmi_list = weight / height ** 2


## Using NumPy

# Convert to array
import numpy as np
np_height = np.array(height)
np_weight = np.array(weight)

# Calculate body mass index (BMI) on the whole array at once
bmi = np_weight / np_height ** 2
print(bmi)


[ 20.56932966  21.14631991  25.23634033  28.7037037   23.12467037]


## Remarks

1) NumPy can do these calculations so easily because it assumes that the arrays contain values of **_the same type!_** If there are multiple types, the array method will try to convert every type to a single one, known as **type coercion**, so pay attention to the type of your inputs and the type of the final array. By the way, the `boolean` `True` and `False` are converted to `1` and `0`, respectively, when coerced to `int`.

2) An array is another kind of type in Python, which means that it comes with its own sets of methods. The methods will behave differently than in lists, e.g., if you sum two list you end up with a new list with all the values whereas if you sum up two arrays, python will add up the elements!

3) Values in an array can be extracted with subsetting just like in list, i.e., with indices. However, arrays can also be subsetted using an array of `boolean` values. This can give you all the values that are, for example, greater than something, see below.

In [44]:
# Lists added 
added_list = height + weight
print(added_list)

# Arrays added
added_array = np_height + np_weight
print(added_array)

# Subsetting arrays
# With an index
print(bmi[2])

# With an array of boolean
print(bmi)
print(bmi > 23)
print(bmi[bmi > 23])




[1.65, 1.78, 1.58, 1.8, 1.57, 56, 67, 63, 93, 57]
[ 57.65  68.78  64.58  94.8   58.57]
25.2363403301
[ 20.56932966  21.14631991  25.23634033  28.7037037   23.12467037]
[False False  True  True  True]
[ 25.23634033  28.7037037   23.12467037]


# 2D NumPy Arrays

When creating an array with NumPy, those array are type `numpy.ndarray`, which means that they are N-Dimensional arrays. You can create 1, 2, and N dimensional arrays. For example, a two dimensional array can be created from a list of two lists, and it will create a 2N array where each row corresponds to each list. 

You can get information about the 2D array with the _attribute_ `numpy.shape()`, which will give you the array's dimensions. N-dimensional arrays can only be of a single type, and multiple types will be coerced into one. 


In [45]:
np_2d = np.array([[1, 4, 56, 6, 7, 8], [7, 4, 3, 2, 7, 32]])
print(np_2d)

np.shape(np_2d)

[[ 1  4 56  6  7  8]
 [ 7  4  3  2  7 32]]


(2, 6)

## Subsetting

To subset an N-dimensional array, use the same rules for lists and arrays, i.e., square brackets and indices. You can select a row with square brackets and then the element in that row using square brackets again, however, a more efficient way of doing it is using a single pair of square brackets and a comma. **The value before the comma specifies the row and the value after the comma specifies the column**. This is similar to subsetting a `data.frame` in R. This can be done with slicing as well. 

You can use these intersections of rows and columns to do element-wise calculations.

In [46]:
print(np_2d)

# Using two subsets
print(np_2d[0][2])

# Using a single subset separated by a comma
print(np_2d[0,2])

# Slicing, all rows second and third elements
print(np_2d[:, 1:3])

# All columns in the second row
print(np_2d[1, :])


[[ 1  4 56  6  7  8]
 [ 7  4  3  2  7 32]]
56
56
[[ 4 56]
 [ 4  3]]
[ 7  4  3  2  7 32]


# NumPy: Basic Statistics
The first step to analyzing large datasets is to generate summarizing statistics about it, and NumPy can do this very efficiently. There are many built-in functions such as: `numpy.mean, numpy.median, numpy.corrcoef (correlation), numpy.std (standard deviation)` etc. Other functions like `sum` and `sort` are also available but they are much faster than in lists.  
Finally, you can generate data with `numpy.random.normal`, round the numbers with`numpy.round`, and paste various objects in a single array with `numpy.column_stack`.


In [49]:
import numpy as np

height = np.round(np.random.normal(1.75, 0.20, 5000), 2)
weight = np.round(np.random.normal(65.35, 15, 5000), 2)
np_city = np.column_stack((height, weight))


## Notes
It's always best to check out the mean and median of the data to get an overall distribution of the data and see if there are outliers 