# Creating and Indexing Arrays

We will start with the most fundamental task in NumPy: creating and working with an array. An **array** is a vector, matrix, or tensor of numbers. In plain English, it is a grid of numeric values that we can efficiently perform numeric operations with, such as statistics or machine learning. 



## Why NumPy and Vectorization? 

In Python, you may be familiar with collections like lists. We could take numbers, such as integers or floating point values, and put them inside lists like this. 

In [None]:
x = [1.2, 5.4, 7.3]
y = [3.2, 5.1, 2.8]

print(f"x = {x}")
print(f"y = {y}")

Now let's say we wanted to add each of the respective elements of $ x $ and $ y $ together. If we were use the Python add `+` operator, it would merge the two lists. In a linear algebra/numeric computing context this is not what we want. 

In [None]:
x = [1.2, 5.4, 7.3]
y = [3.2, 5.1, 2.8]

x + y 

To achieve this in plain Python, we would have to use a `for` loop or list comprehension with a `zip()` like below:  

In [None]:
x = [1.2, 5.4, 7.3]
y = [3.2, 5.1, 2.8]

[_x + _y for _x,_y in zip(x,y)]

Not to mention, **Python is REALLY slow**. It is not performant at doing numeric computing in this manner. Python's computational efficiencies comes from the low-level libraries like NumPy which are written in C. 

So what would this look like with NumPy using its `ndarray` type? Let's take a look. 

In [None]:
import numpy as np 

x = np.array([1.2, 5.4, 7.3])
y = np.array([3.2, 5.1, 2.8])

x + y 

Note how we bring in `numpy` and alias it as `np`, a common best practice. We then declare two arrays using the `array()` functions and pass two lists of numeric values to it. We can then proceed to add the the two arrays together using the `+` operator. 

This **vectorized** approach to doing mathematical operations is a lot more efficient and convenient, avoiding a lot of `for` loops and leveraging efficiencies in NumPy. More specifically, NumPy is optimized to handle lists or grids of numbers, and performing mathematical operations with other lists or grids of numbers.  Wtih data full of numeric values, we can take advantage of the fact CPU's and GPU's to do math more efficiently on multiple values at once. Therefore, vectorization is a requirement when you do tasks like machine learning.

You may hear a list of numbers referred to as a **vector**, and a grid of numbers in two or more dimensions referred to as a **matrix** or **tensor**. Below is an example of a vector $ \vec{v} $ and a matrix $ A $. 

$$
\Large \vec{v} = \begin{bmatrix} \Large 3 \\ \Large 2 \\ \Large 7 \end{bmatrix}
$$

$$
\Large A = \begin{bmatrix} \Large -1 & \Large 1  \\ \Large 0.5 & \Large -2 \end{bmatrix}
$$


> Linear algebra is a topic in itself, and you should [consider taking the Anaconda course on Linear Algebra](https://learning.anaconda.cloud/linear-algebra). 

## Declaring an Array 

Let's dive into the array more, or more specifically the `ndarray`. This is probably the most fundamental data type in NumPy. As we saw earlier, we can declare it using a simple numeric list passed to the `array()` function. 

In [None]:
x = np.array([6, 1, 17, 3, 0, 3]) 
x

In [None]:
type(x)

That is a 1-dimensional array. An array can have 0 dimensions, meaning it is just a single scalar value. 

In [None]:
np.array(5)

We can also make a 2-dimensional array, meaning we have an array consisting of rows and columns. You can do this by nesting lists `[]` inside a list `[[]]`. 

In [None]:
x = np.array([[6, 1, 17], 
              [3, 0, 3]]) 
x

A common operation to check the array is to see its `shape`. We can see below this has 2 rows and 3 columns.

In [None]:
x.shape

We can also check the `dtype` of the array, which is the type of numeric values it is holding.

In [None]:
x.dtype

The datatype of the array is inferred from the list of numbers you provide, but you can also be explicit. If we wanted to declare an array of integers and force them to be `float32`, we can do that. 

In [None]:
x = np.array([[6, 1, 17], 
              [3, 0, 3]], dtype='float32') 
x

NumPy [supports many datatypes](https://numpy.org/doc/stable/reference/arrays.dtypes.html) including dates, times, and arbitrary data. Generally when starting out, you will work with floats and integers. The `8`, `16`, `32`, or `64` next to the type specifies how large the range that number can be (at the cost of more memory). An `int16` can hold any integer through $ -32768 $ through $ 32767 $, while `int64` can hold $ -2,147,483,648 $ to $ 2,147,483,647 $. 

Integers also have unsigned counterparts (meaning they can only be 0 or more, no negatives) with `uint8`, `uint16`, `uint32`, etc. 

## Declaring Higher-Dimensional Arrays

You can get really crazy, declaring higher-dimensional tensors where you have stacks and stacks of numeric grids representing images and video data. Below we have a 3x3 pixel image stored as a tensor, where each red-green-blue channel is represented in three sub-layers. The point is... you can get crazy with how you ingest data and store it in higher-dimensional numeric patterns. 

In [None]:
my_image = np.array([
    [[0, 1, 3],
     [6, 2, 6], 
     [1, 5, 4]], 
    [[8, 3, 19],
     [33, 34, 11], 
     [13, 14, 89]], 
    [[14, 68, 17],
     [66, 84, 92], 
     [4, 2, 58]]
])

my_image

We can get the number of dimensions for a given array using the `ndim` property. 

In [None]:
my_image.ndim

We can also get the `shape` of the array. 

In [None]:
my_image.shape

We can create an array with a specified number of dimensions, and it will take that provided list and fit it into those dimensions. Below, we take two rows of data but specify it to be nested in 5 dimensions. 

In [None]:
np.array([[1, 2, 3],[4,5,6]], ndmin=5)


## Common Methods to Create an Array

We will learn how to create arrays off of external data (like CSV's or SQL databases, for example). But here are a few other notable ways to create an array. 

It is common to intialize an array of just 0's. 

In [None]:
np.zeros(shape=(3,2), dtype='int8')

We can also generate an array with random numbers, which is common for creating simulations or intializing weight values in a neural network. We will learn more about random number generation later. 

In [None]:
np.random.randint(6, size=(3,2))

Another common function is `arange`, which will generate incremental values (with the `start` inclusive and the `end` exclusive) at a specified number of `step` (which defaults to `1`). 

In [None]:
np.arange(start=0, stop=10, step=2)

Similar to `arange` is the `linspace` function. Rather than specify a `step`, it specifies the number of evenly spaced numbers between an interval (with `endpoint` included by default). Below we get 40 evenly spaced points between `0` and `10` but we omit the `10` to make the end exclusive. 

In [None]:
np.linspace(start=0, stop=10, num=40, endpoint=False)

In [None]:
You will find `arange` and `linspace` are used a lot for plotting functions in `matplotlib`. There are many examples in the Anaconda course [Statistics and Hypothesis Testing](https://learning.anaconda.cloud/statistics-and-hypothesis-testing).

## Differences from pandas 

What makes NumPy different from Pandas? It is first important to establish that pandas and NumPy are not in competition. As a matter of fact, you will find it common to import a pandas `DataFrame` and turn into a NumPy `ndarray` which we will explore in a later section. They are just different tools used for different problems. Generally when you are working with a lot of tabular data (that is two-dimensional with simply rows and columns) of different datatypes, pandas is going to be your best bet for that task. Think spreadsheet tasks, and that is where pandas will excel (no pun intended). 

NumPy is going to be preferred when you are dealing with numerical data in higher dimensions, and you start to get intensive on the mathematical operations. Think importing images where each image has 3 layers (the red, green, and blue channels) and you have a whole stack of images representing frames in videos. This is definitely not something pandas is streamlined for, while NumPy will do this task more effectively. 

That being said you take any pandas `DataFrame` (or certain columns of it) and turn it into a Numpy `ndarray` using the `values` property. We will see this more in action later. 

In [None]:
import pandas as pd
import numpy as np 

df = pd.DataFrame({'x': [1, 2], 'y' : [3,4]})
df

In [None]:
my_array = df.values
my_array

In [None]:
type(my_array)

## Exercise

Create this matrix of numbers using a NumPy array, and put it in the cell below. 

$$
\Large A = \begin{bmatrix} \Large 7 & \Large 1 & \Large -3 \\ \Large 2 & \Large -2 & \Large 31 \end{bmatrix}
$$

In [None]:
# Put your code here 





### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

In [None]:
import numpy as np 

A = np.array([
    [7, 1, -3], 
    [2, -2, 31]
])
A 