#Arrays and Vectors

The Python types we've seen so far are all *built-in* to Python.  This means that they're automatically available when you start writing Python code.  As with any modern language, however, programmers are always adding new functionality to Python and they often want to share their code with others.  The main way this is done is with packages.

A package contains extra object types, functions, and variables that are not included in base Python.  Some packages are specialized to niche applications, while others are so popular that nearly every coder uses them.

Now one particular package, numpy, sits as the foundation for scientific and data computing in python. Many other data science packages are built on top of numpy, so it's very important to understand its function.

[Numpy](http://www.numpy.org/) is the core of the [pydata](http://pydata.org/) ecosystem.

------

*From the Numpy Website*

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

- a powerful N-dimensional array object
- sophisticated (broadcasting) functions
- tools for integrating C/C++ and Fortran code
- useful linear algebra, Fourier transform, and random number capabilities

------

We'll often use numpy to work with data.  In this notebook, we'll begin introducing its functionality with a few simple examples.

To use the functionality provided by a package, we first have to import it into our program.  This is done with the import keyword.  At the point at which we import a package, we can also rename it with the `as` keyword. Traditionally, numpy is renamed to np. The reason that we do this is that there is a convention in the pydata community to import data packages with shorter names.


In [1]:
import numpy as np

In [2]:
np.arange(0,20)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

Now this should look pretty familiar; the arange function is analogous to the range function we saw earlier.  Instead of returning a range, however, this function returns something we call an array.

Arrays are similar to the sequences we saw earlier.  For example, we can access items by their indices.

In [8]:
my_array = np.arange(20)
my_array[2] = 12
my_array

array([ 0,  1, 12,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

Arrays also have some unique characteristics.  For one thing, all the objects contained in an array must be the same type.  This may seem like a pesky restriction, but it allows arrays to work very efficiently on large amounts of data.

In [9]:
my_array[5] = "shoe"

ValueError: invalid literal for int() with base 10: 'shoe'

The data type of an array can be accessed through the dtype property, this will give us information about all the objects stored in the array. That means that you can't change the type of an array once it's created (unless you work around this).

In [10]:
my_array.dtype

dtype('int64')

Another thing about arrays is that once they're created, their size is fixed. You can't pop or append to an array. That's because of the way that arrays allocate memory under the hood. The fixed size allows arrays to be optimized for high performance.

In [11]:
new_array = np.array([5,10,15,20])

In [12]:
new_array.append(5)

AttributeError: 'numpy.ndarray' object has no attribute 'append'

In [13]:
new_array.pop()

AttributeError: 'numpy.ndarray' object has no attribute 'pop'

You can see in the error messages that the arrays we're dealing with have type `'numpy.ndarray'`. This stands for n-dimensional array.  We can make n anything that we want.

For example, we can create a 2-dimensional matrix with uniformly distributed random numbers between 0 and 1.

In [14]:
np.random.rand(5,5)

array([[ 0.3485017 ,  0.17995236,  0.3595112 ,  0.22129269,  0.70095449],
       [ 0.24573702,  0.25872806,  0.42839541,  0.71419389,  0.54417784],
       [ 0.79930945,  0.86196549,  0.85641477,  0.61529105,  0.89150267],
       [ 0.6562489 ,  0.1239075 ,  0.43037002,  0.59389352,  0.34547663],
       [ 0.28295824,  0.84908176,  0.79929516,  0.48377836,  0.32917841]])

Or a 3 dimensional cube of numbers.

In [26]:
np.random.rand(5,5,5)

array([[[ 0.84272158,  0.83679704,  0.5257461 ,  0.73736653,  0.45110009],
        [ 0.13841672,  0.10757949,  0.63145918,  0.16771083,  0.64128712],
        [ 0.80698612,  0.78004934,  0.78796473,  0.47922827,  0.70641219],
        [ 0.43528213,  0.65893285,  0.67785015,  0.64606294,  0.3998132 ],
        [ 0.65588931,  0.65056016,  0.53237286,  0.18387084,  0.3339849 ]],

       [[ 0.5422109 ,  0.44219767,  0.39250136,  0.47459834,  0.69258357],
        [ 0.91240791,  0.96729439,  0.14329598,  0.06610903,  0.16365455],
        [ 0.27483066,  0.84626304,  0.71325868,  0.82706608,  0.43683008],
        [ 0.30363783,  0.93689354,  0.72157754,  0.08772284,  0.57431791],
        [ 0.77952137,  0.60743034,  0.98645455,  0.33540844,  0.53469506]],

       [[ 0.89680185,  0.15110855,  0.43264088,  0.74135463,  0.14078825],
        [ 0.34220042,  0.53011661,  0.90455791,  0.25400812,  0.01920023],
        [ 0.93290593,  0.18418512,  0.43427611,  0.9612554 ,  0.45270675],
        [ 0.61656425,

We can keep going to higher and higher dimensional spaces, but doing so is outside the scope of this course.

The random number we generated above were drawn from a uniform probability distribution.  This is a very simple distribution - equal to 1 between 0 and 1 and 0 everywhere else.  NumPy gives us many other distributions to work with, however.

In [4]:
np.random?

We can see that there are a variety of functions at our disposal. Let's get a bunch of uniformly distributed random variables.

In [27]:
np.random.randn(20)

array([-0.37930757,  1.37538244,  0.41835287,  0.07467508,  0.07084404,
        0.69637896, -1.11407639,  1.3812167 ,  0.22986012,  0.70550322,
        1.32124292,  0.47818417,  1.00462055, -2.64293399,  0.61211059,
        0.07565507, -1.12787497,  1.15320694, -2.24614077, -0.13124562])

Now that we have this array of normal random variables, let's reshape it into a matrix. This is done with the reshape method.

In [6]:
ar = np.random.randn(20).reshape(5,4)

In [7]:
ar

array([[-0.62452718,  0.46052178,  0.40361749, -1.05052864],
       [ 0.3509878 , -0.65577956, -1.24891939,  2.68896709],
       [-1.28241532,  1.01829396,  1.07559661, -0.12953369],
       [ 0.6298654 ,  1.53532744,  0.22540621, -0.18991791],
       [ 0.10894515,  1.5719786 , -0.32377154,  1.09771267]])

Now when you're working with data, you'll often need to calculate some statistics from it. What's awesome about numpy arrays is that they have a lot of common statistical functions baked in.  For example, let's get the mean of our array.

In [8]:
ar.mean()

0.28309134886647885

The standard deviation is just as easy.

In [41]:
ar.std()

0.96331100760399779

At time, we're going to want to do these operations on each column or on each row. Doing so is still easy.

Here are the means of each column

In [35]:
ar.mean(axis=0)

array([-0.08367184,  0.38935699, -0.39893097,  0.43886695])

Here are the means of each row.

In [42]:
ar.mean(axis=1)

array([ 0.60553449, -0.39209588,  0.34833868,  0.63370937, -0.1756636 ])

This lesson has only begun to scratch the surface of NumPy's capabilities.  The NumPy library is huge and no one class could truly cover all of it.  Additionally, numpy has a partner library, SciPy (which stands for scientific python), that builds on NumPy and adds a great deal of advanced functionality. We'll return to NumPy later in this course.  You'll learn more about SciPy as you continue through the data science program.