# Introduction to `numpy`

# Contents:

Dataset used: california_housing_test.csv

1. Setup
2. `numpy`
  1. Intro and arrays
  2. Loading `numpy`
  3. Creating arrays
  4. Reshaping
  5. Basic operations
  6. Logic and filtering
  7. Loading data

# `numpy`

## What is `numpy`?

`numpy` is a Python package for scientific computing. It gives us _homogenous multidimensional arrays_ -- a way of representing grids of elements where all elements are the same data type -- and functions to work efficiently with them. `numpy` both underpins and complements other data science libraries, including `pandas`, `matplotlib`, and `scikit-learn`.

## Arrays vs lists

| `numpy` arrays | Python `list` |
|---|---|
| all elements must be the same type | elements can be different types |
| fixed size | can change size |
| n-dimensional | 1-dimensional |
| faster to process | slower to process |
| consumes less memory | consumes more memory |

## Loading `numpy`

We can `import` numpy like any other module. For convenience, `numpy` is typically loaded `as np` -- an alias that makes referencing it easier.

In our Python code, we will import `numpy` as `np`.

In [None]:
import numpy as np

## `numpy` arrays

The main object in `numpy` is the `ndarray`, also referred to as the `array`. Dimensions in an array are called _axes_. Most arrays we'll work with will be one-dimensional _vectors_ and two-dimensional _matrices_.

We can create an array by calling `np.array()` and passing in data as a single value, like a list. Below is a matrix. The first axis has a length of two, and the second axis has a length of 3.

In [None]:
a = np.array([[1, 2, 3],
              [3, 2, 1]])
a

An `array` has an `ndim` attribute indicating the number of its axes, a `size` indicating the number of values it has, and a `shape` indicating its size in each dimension. It also has a `dtype` describing what data type all of the elements in the array are.

In [None]:
# number of dimensions
print(a.ndim)

# notice that the shape is rows x columns
print(a.shape)

# notice that the size is rows * columns
print(a.size)

# int32 is a numpy-provided dtype
print(a.dtype)

We can create arrays with placeholder content in several ways. This is useful when we know how many elements will be in an array, but not their values, as `numpy` arrays have fixed size. The full list is in [`numpy`'s documentation](https://numpy.org/devdocs/user/basics.creation.html#arrays-creation).

In [None]:
# create an 2x3x2 array of zeros. notice the double parentheses
np.zeros((2, 3, 2))

In [None]:
# create an array of ones based on the earlier a array
np.ones_like(a)

We can also create arrays by specifying a range of values through `arange()` or generating random ones through functions like `random.randint()` and `random.random()`.

In [None]:
# create a 1D array from 1 til 10 in steps of 2
np.arange(1, 10, 2)

In [None]:
# create a 1D array from 0 to 1 in steps of 0.1
np.arange(0, 1, 0.1)

In [None]:
# seed is used to create a reproducible random example
np.random.seed(1)
# create a 3x4 array of random integers between 1 and 10
np.random.randint(1, 10, (3, 4))

We can repeat values and arrays to create bigger ones with `repeat()` and `tile()`.

`repeat()` repeats each element of an array a specified number or times.
`tile()` repeats the whole input array a specified number of times.

In [None]:
# create a 1D array through repetition
np.repeat(10, 5)

In [None]:
onedim_arr = np.array([1, 2, 3, 4, 5])
multidim_arr = np.repeat(onedim_arr, 5)
multidim_arr

In [None]:
# create a 2D array through repetition
multidim_arr = np.tile(onedim_arr, (5,1))
multidim_arr

In [None]:
# repeat onedim_arr in multiple directions
another_arr = np.tile(onedim_arr, (3,3))
another_arr

### Simulating data

Sometimes, it can be useful to simulate data to make sure an analysis works as expected. `numpy`'s `random` submodule provides support for drawing random samples from a variety of distributions. To generate random samples, we first call `default_rng()` to create a sample Generator. Then, we use the Generator's various distribution methods, like `normal()`, to create sample arrays.

In [None]:
# create the sample Generator
rng = np.random.default_rng(seed=42)

# create a 5x4 array w/ normally distributed data
# distribution has mean 10, standard dev. 2.5
randrng = rng.normal(10, 2.5, (5, 4))
randrng

In [None]:
# using the same Generator to draw a random sample from a Poisson distribution
# lam is 1 and size is 5x6
poissonrng = rng.poisson(1, (5, 6))
poissonrng

## Reshaping arrays

We can change the shape of an array in various ways, leaving the elements the same.

In [None]:
# get the transpose of an array
b = np.array([[5, 4, 3, 2, 1, 0],
              [10, 8, 6, 4, 2, 0]])
b.T

In [None]:
# change dimensions
b.reshape(4, 3)

In [None]:
# collapse an array to one dimension
b.flatten()

If two arrays share the same size along an axis, we can stack them with `np.hstack()` and `np.vstack()`. Notice that we pass the arrays to stack as a tuple.

In [None]:
# stack a and b horizontally
np.hstack((b, a))

In [None]:
# reshape b before stacking vertically
np.vstack((b.reshape(4, 3), a))

## Basic operations

`numpy` arrays let us perform vector operations, manipulating all the elements in an axis without writing loops.

In [None]:
arr1 = np.array([5, 10, 15, 20])
arr1

In [None]:
arr2 = np.arange(5, 9)
arr2

In [None]:
np.sin(np.arange(0, 360, 45) * np.pi/180)

We can perform operations when arrays are the same length along the axis in use, or when values can be _broadcast_, or repeated, along an axis.

In [None]:
# arr1 and arr2 are the same length
arr1 - arr2

In [None]:
results = []
for i, j in zip(arr1, arr2):
    results.append(i - j)

results

In [None]:
# multiple each element in arr1 by 2
arr1 * 2

In [None]:
# incompatible shapes
arr2 + np.array([1, 2])

In [None]:
short_output = []
for i, j in zip(arr2, np.array([1, 2])):
    short_output.append(i + j)

short_output

We can also summarize the values in an array.

In [None]:
print(f'''arr1 sums to {arr1.sum()}.
Its max value is {arr1.max()}, and its mean is {arr1.mean()}.''')

Some descriptive statistics are only `numpy` functions, and are not available as `array` methods like we saw above.

In [None]:
np.median(arr1)

In [None]:
arr1.median()

## Operations in multiple dimensions

In a multi-dimensional array like a matrix, elements of same-sized arrays will be paired up for operations, just as with vectors. If we perform an operation with a matrix and a vector of the same size in one dimension, the vector can be _broadcast_ to repeat along the other dimension.

In [None]:
tens = np.arange(0, 120, 10).reshape(3, 4)
tens

In [None]:
horizontal = np.array([-5, -10, -15, -20])

tens + horizontal

In [None]:
vertical = np.array([[100],
                     [200],
                     [300]])
tens + vertical

We can still calculate statistics for multidimensional arrays, but we must specify the axis to calculate over. To calculate values for each column, we use `axis=0`. To calculate for each row, we use `axis=1`.

#### 1. `axis=0`: Column-wise Operation

When you set `axis=0`, the operation is performed along the vertical axis of the array, i.e., down the rows. This is often referred to as a column-wise operation.

- **In a 2D array**: This calculates the mean of each column.
- **Example**: If you have a 2x3 array (2 rows, 3 columns), using `axis=0` with `.mean()` will result in a 1D array with 3 elements, each being the mean of the columns.

#### 2. `axis=1`: Row-wise Operation

Conversely, when you set `axis=1`, the operation is performed along the horizontal axis of the array, i.e., across the columns. This is referred to as a row-wise operation.

- **In a 2D array**: This calculates the mean of each row.
- **Example**: If you have a 2x3 array, using `axis=1` with `.mean()` will yield a 1D array with 2 elements, each being the mean of the rows.

#### Summary

- `axis=0`: Column-wise operation (down the rows).
- `axis=1`: Row-wise operation (across the columns).

In [None]:
tens.mean(axis=0)

In [None]:
tens.mean(axis=1)

## Indexing, slicing, and iterating

We can index and slice arrays like we would a list.

In [None]:
arr1

In [None]:
arr1[1]

In [None]:
arr1[1:3]

We can iterate over arrays as well, though vectorized `numpy` operations are preferred when possible.

In [None]:
for i in arr1:
    print(i)

Multidimensional arrays like matrices have one index per axis. We can pass in more than one index within the square brackets.

In [None]:
tens

In [None]:
# indexing goes row, column
tens[1, 2]

In [None]:
# get the first row
tens[0]

In [None]:
# get the first column
tens[:,0]

In [None]:
# slice rows 1-2, columns 2-3
tens[0:2, 1:3]

## Mutations and copies

We can also update individual elements in an array. Note that any variables that refer to that array will also change, just like with mutating lists. To make an independent copy of an array, use the `.copy()` method.

In [None]:
# create a 3x4 array of random integers
matrix = np.random.randint(1, 11, 12).reshape(3, 4)
matrix2 = matrix

#make a copy
matrix3 = matrix.copy()

matrix

In [None]:
# replace the second row
matrix2[1] = [0, 0, 0, 0]
matrix2

In [None]:
# the original also changed
matrix

In [None]:
# the copy did not
matrix3

## Logic and filtering

`numpy` arrays work with boolean expressions. Each element is checked, and the result is an array of `True`/`False` values. They resulting arrays sometimes called _masks_ because they are used to mask, or filter, data.

In [None]:
tens

In [None]:
# evaluate whether each element is divisible by 3
tens % 3 == 0

In [None]:
# the same thing with for loops
masked = []

for row in tens:
    masked_row = []
    for col in row:
        masked_row.append(col % 3 == 0)
    masked.append(masked_row)

masked

To filter use a boolean expression as a mask, pass it into square brackets after the array to mask.

In [None]:
tens[tens % 3 == 0]

In [None]:
# also works
mask = tens % 3 == 0
tens[mask]

In [None]:
# comparison in standard python
filtered_data = []

for row in tens:
    for col in row:
        if col % 3 == 0:
            filtered_data.append(col)

filtered_data

We can even use masks to generate new arrays with conditionals. `np.where()` takes as its arguments a boolean expression, an expression to evaluate if `True`, and an expression to evaluate elsewise. This is analagous to creating a new array based on an old one with a `for` loop and `if`/`else` statements, but much faster.

In [None]:
np.where(tens % 3 == 0,  # condition
         tens,  # return the element if True
         0)  # return 0 if False

In [None]:
result = []

for row in tens:
    result_row = []
    for col in row:
        if col % 3 == 0:
            result_row.append(col)
        else:
            result_row.append(0)
    result.append(result_row)

result

## Loading flat files to `numpy` arrays

We can also load data from files into `numpy` arrays. Recall the California housing csv we loaded earlier:

In [None]:
with open('../05_data/slides_data/california_housing_test.csv', 'r') as f:
    # print the first five lines
    for i in range(5):
        print(f.readline())

We can load this data to a `numpy` array using `genfromtxt()`, which takes a path to a file as an argument. We should also include a `delimiter` argument, indicating how values are separated. Useful optional arguments include `names`, which we can use to tell `numpy` that the first row contains column names; `skip_header`, which we can use to skip the first few lines; and arguments for how to handle missing values.

There is also `loadtxt()`, which is simpler than `genfromtxt()`. The former does not have options for using header names or handling missing values.
To fill missing values, we need to set the `usemask` parameter to `True`. The result will be a masked array in that case, which we can then convert to a regular array with the `filled(np.nan)` method. Note that this only works for float columns.

In [None]:
hd = np.loadtxt('../05_data/slides_data/california_housing_test.csv',
                delimiter=',',
                skiprows=1)
hd.shape

In [None]:
housing_data = np.genfromtxt('../05_data/slides_data/california_housing_test.csv',
                             delimiter=',',
                             #skip_header=1,  # let's use names instead
                             names=True,
                             missing_values=500001.0,
                             usemask=True  # replace missing values
                            )

housing_data.filled(np.nan)

If you look closely, the array we got back is not a matrix. Each row looks like a tuple with comma-separated values. If we check the `shape`, we see 3000 rows and what looks like no columns. The `dtype` attribute lists all our field names.

In [None]:
housing_data.shape

In [None]:
housing_data.dtype

In this case, `genfromtxt` returned a _structured array_, a different type of array than the ones we have seen so far. We can refer to fields by putting the field name as a string in square brackets, similar to referencing a dictionary key. However, we will soon see a data type in another package, `pandas`, that is even better suited to working with columns in tabular data like this.

In [None]:
np.median(housing_data['housing_median_age'])

In [None]:
housing_data['median_income'].mean()

# References

- NumPy. _Basic Numpy_. https://numpy.org/devdocs/user/quickstart.html
- NumPy. _Numpy Routines_. https://numpy.org/doc/stable/reference/routines.html