# Arrays

The Python types we have seen so far are all built into Python; this means that they are automatically available when you start writing Python code. As with any modern language, however, programmers are always adding new functionality to Python and they often want to share their code with others.  The main way this is done is with packages.

A package contains extra object types, functions, and variables that are not included in base Python. Some packages are specialized to niche applications, while others are so popular that nearly every coder uses them.

One particular package, NumPy, sits as the foundation for scientific and data computing in Python. Many other data science packages are built on top of NumPy, so it is very important to understand its function.

[NumPy](http://www.numpy.org/) is the core of the [PyData](http://pydata.org/) ecosystem.

------

*From the NumPy Website*

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

- a powerful N-dimensional array object
- sophisticated (broadcasting) functions
- tools for integrating C/C++ and Fortran code
- useful linear algebra, Fourier transform, and random number capabilities

------

We often use NumPy to work with data. In this notebook, we will begin introducing its functionality with a few simple examples.

To use the functionality provided by a package, we first have to import it into our program.  This is done with the `import` keyword. At the point at which we import a package, we can also rename it with the `as` keyword. Traditionally, `numpy` is renamed `np`. The reason that we do this is that there is a convention in the PyData community to import data packages with shorter names.


In [1]:
import numpy as np

In [2]:
np.arange(0,20)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

This should look pretty familiar; the `arange` function is analogous to the `range` function we saw earlier. Instead of returning a range, however, this function returns something we call an *array*.

Arrays are similar to the sequences we saw earlier. For example, we can access items by their indices.

In [3]:
my_array = np.arange(20)
my_array[2] = 12
my_array

array([ 0,  1, 12,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

Arrays also have some unique characteristics.  For one thing, all the objects contained in an array must be the same type.  This may seem like a pesky restriction, but it allows arrays to work very efficiently on large amounts of data.

In [4]:
my_array[5] = "shoe"

ValueError: invalid literal for int() with base 10: 'shoe'

The data type of an array can be accessed through the `dtype` property; this will give us information about all the objects stored in the array. That means that you cannot change the type of an array once it is created (unless you work around this).

In [5]:
my_array.dtype

dtype('int64')

In addition, once an array is created, its size is fixed. You cannot pop or append to an array because of the way that arrays allocate memory. The fixed size allows arrays to be optimized for high performance.

In [6]:
new_array = np.array([5,10,15,20])

In [7]:
new_array.append(5)

AttributeError: 'numpy.ndarray' object has no attribute 'append'

In [8]:
new_array.pop()

AttributeError: 'numpy.ndarray' object has no attribute 'pop'

You can see in the error messages that the arrays we are dealing with have type `'numpy.ndarray'`. This stands for *n*-dimensional array. We can make *n* anything that we want.

For example, we can create a two-dimensional matrix with uniformly distributed random numbers between 0 and 1.

In [9]:
np.random.rand(5,5)

array([[ 0.10648234,  0.83217679,  0.63806576,  0.69003448,  0.9748543 ],
       [ 0.59020448,  0.86675064,  0.80940695,  0.88256189,  0.99050187],
       [ 0.87244291,  0.07258846,  0.57155022,  0.44507565,  0.045058  ],
       [ 0.93489746,  0.09116535,  0.59777247,  0.09772608,  0.7570062 ],
       [ 0.03492041,  0.14402342,  0.89550503,  0.51289362,  0.68879071]])

Or we can create a three-dimensional cube of numbers.

In [10]:
np.random.rand(5,5,5)

array([[[ 0.62190203,  0.67813165,  0.24879432,  0.85351031,  0.47839552],
        [ 0.12885694,  0.85591285,  0.7676117 ,  0.18049391,  0.24903808],
        [ 0.33438681,  0.00883342,  0.30958822,  0.05377948,  0.09619761],
        [ 0.63619928,  0.47561247,  0.86378828,  0.01852412,  0.03134734],
        [ 0.31974582,  0.84927553,  0.02526485,  0.66531965,  0.61367574]],

       [[ 0.43447214,  0.25742403,  0.56906217,  0.72603296,  0.21547998],
        [ 0.08282102,  0.88334032,  0.11279749,  0.23502522,  0.98089176],
        [ 0.67671649,  0.12275952,  0.54825469,  0.39615149,  0.44027412],
        [ 0.34341265,  0.27453947,  0.50699397,  0.02295559,  0.51015296],
        [ 0.86374756,  0.82933671,  0.97860856,  0.16388404,  0.18676941]],

       [[ 0.50137612,  0.86100886,  0.46771107,  0.49762319,  0.09942061],
        [ 0.64682201,  0.80645434,  0.30591674,  0.78372676,  0.62082823],
        [ 0.0725108 ,  0.43652937,  0.01882027,  0.04576894,  0.03794534],
        [ 0.8221575 ,

We can keep going to higher and higher dimensional spaces, but doing so is outside the scope of this course.

The random numbers we generated above were drawn from a uniform probability distribution. This is a very simple distribution, equal to 1, between 0 and 1, and 0 everywhere else. NumPy gives us many other distributions to work with, however.

In [11]:
np.random?

We can see that there are a variety of functions at our disposal. Let's get some uniformly distributed random variables.

In [12]:
np.random.randn(20)

array([ 0.92269111, -2.2276769 , -1.28704041,  0.89730048, -1.33606677,
       -0.65326701,  0.7493691 , -1.29093916, -0.65151334, -2.03886176,
        0.11773115, -0.37387359,  0.11986561,  1.96490916,  0.21022098,
        1.62917658, -0.80917623, -0.23805272, -0.92187318,  1.31539243])

Now that we have this array of normal random variables, let's reshape it into a matrix. This is done with the reshape method.

In [13]:
ar = np.random.randn(20).reshape(5,4)

In [14]:
ar

array([[ 0.77400258, -0.33902124,  0.84960228,  0.67791938],
       [ 1.51416211,  0.88888946, -1.18039892,  0.7369464 ],
       [-1.18751671, -0.15008941, -0.02608664,  0.02503204],
       [ 0.87306404, -0.26468326, -1.78996207,  0.41580606],
       [ 0.14680233,  0.53553181, -0.34289575,  1.11064037]])

When you are working with data, you will often need to calculate some statistics from it. NumPy arrays have a lot of integrated common statistical functions.  To see an example, let's get the mean of our array.

In [15]:
ar.mean()

0.1633872434123195

The standard deviation is just as easy.

In [16]:
ar.std()

0.82714511473596986

Sometimes we will want to do these operations on each column or on each row. Doing so is still easy.

Here are the means of each column.

In [17]:
ar.mean(axis=0)

array([ 0.42410287,  0.13412547, -0.49794822,  0.59326885])

Here are the means of each row.

In [18]:
ar.mean(axis=1)

array([ 0.49062575,  0.48989976, -0.33466518, -0.19144381,  0.36251969])

This lesson has only begun to scratch the surface of NumPy's capabilities.  The NumPy library is huge, and no one class could truly cover all of it. Additionally, NumPy has a partner library, SciPy (which stands for *scientific Python*), that builds on NumPy and adds a great deal of advanced functionality. We will return to NumPy later in this course. You will learn more about SciPy as you continue through the data science program.