# Introduction to Numerical Python (NumPy)

### SciPy
**SciPy** (pronounced “Sigh Pie”) is a set of open-source Python libraries specialized for scientific computing. Many of these libraries are critical to the data processing. Together they constitute a set of tools for calculating and displaying data.
Among the libraries that are part of the SciPy group, there are some in particular that will be discussed in the following lessons:
 - NumPy
 - Matplotlib
 - Pandas

**NumPy**<br>
This library, whose name means Numerical Python, actually constitutes the core of many other Python libraries that have originated from it. Indeed NumPy is the foundation library for scientific computing in Python since it provides data structures and high-performing functions that the basic package of the Python cannot provide. In fact, NumPy defines a specific data structure that is an N-dimensional array defined as ndarray.

The knowledge of this library is essential in terms of numerical calculations since its correct use can greatly influence the performance of a computation. 
This package provides some features that will be added to the standard Python:
 - ndarray: a multidimensional array much faster and more efficient than those provided by the basic package of Python.
 - element-wise computation: a set of functions for performing this type of calculation with arrays and mathematical operations between arrays.
 - reading-writing data sets: a set of tools for reading and writing data stored in the hard disk.


**NumPy: A Little History**<br>
At the dawn of the Python language, the developers began to need to perform numerical calculations, especially when this language began to be considered by the scientific community.<br>
The first attempt was Numeric, developed by Jim Hugunin in 1995, which was successively followed by an alternative package called Numarray. Both packages were specialized for the calculation of arrays, and each of them had strengths depending on which case they were used. Thus, they were used differently depending on where they showed to be more efficient. This ambiguity led then to the idea of unifying the two packages and therefore Travis Oliphant started to develop the NumPy library. Its first release (v 1.0) occurred in 2006.
From that moment on, NumPy has proved to be the extension library of Python for scientific computing, and it is currently the most widely used package for the calculation of multidimensional arrays and large arrays. In addition, the package also comes with a range of functions that allow you to perform operations on arrays in a highly efficient way and to perform high-level mathematical calculations.<br>
Currently, NumPy is open source and licensed under BSD (Berkeley Software Distribution). There are many contributors that with their support have expanded the potential of this library.

**Ndarray: The Heart of the Library**<br>
NumPy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.<br>

The Python core library provided Lists. A list is the Python equivalent of an array, but is resizeable and can contain elements of different types.<br>

Numpy data structures perform better in:<br>

 - Size - Numpy data structures take up less space
 - Performance - they have a need for speed and are faster than lists
 - Functionality - SciPy and NumPy have optimized functions such as linear algebra operations built in.
 
The whole NumPy library is based on one main object: **ndarray** (which stands for N-dimensional array). This object is a multidimensional homogeneous array with a predetermined number of items: homogeneous because virtually all the items within it are of the same type and the same size. In fact, the data type is specified by another NumPy object called **dtype** (data-type); each ndarray is associated with only one type of dtype.<br>
The number of the dimensions and items in an array is defined by its **shape**, a tuple of N-positive integers that specifies the size for each dimension. The dimensions are defined as axes and the number of **axes**.

Moreover, another peculiarity of NumPy arrays is that their size is fixed, that is, once you defined their size at the time of creation, it remains unchanged. This behavior is different from Python lists, which can grow or shrink in size.<br>
To define a new ndarray, the easiest way is to use the **array()** function, passing a Python list containing the elements to be included in it as an argument.<br><br>



First, let's import Numpy as np. This lets us use the shortcut `np` to refer to Numpy. 

In [1]:
import numpy as np


## Creating Arrays

Create a list and convert it to a numpy array

In [2]:
mylist = [1, 2, 3]
x = np.array(mylist)
x

array([1, 2, 3])

Or just pass in a list directly

In [3]:
y = np.array([1, 2, 3])
y

array([1, 2, 3])

Pass in a list of lists to create a multidimensional array.

In [4]:
a = np.array([[1, 2, 3],[4, 5, 6]])
a

array([[1, 2, 3],
       [4, 5, 6]])

You can easily check that a newly created object is an `ndarray`, passing the new variable to the `type()` function.

In [5]:
type(a)

numpy.ndarray

In order to know the associated `dtype` to the just created `ndarray`, you have to use the `dtype` attribute.

In [6]:
a.dtype

dtype('int32')

unlike Python lists, NumPy is constrained to arrays that all contain the same type. If types do not match, NumPy will upcast if possible (here, integers are up-cast to floating point):

In [7]:
b = np.array([[1, 2, 3], [4., 5., 6]])
b

array([[1., 2., 3.],
       [4., 5., 6.]])

In [8]:
c = np.array([['a', 'b', 'c'], [1, 2, 3]])
c

array([['a', 'b', 'c'],
       ['1', '2', '3']], dtype='<U11')

If we want to explicitly set the data type of the resulting array, we can use the `dtype` keyword:

In [9]:
np.array([1, 3, 4, 7], dtype = 'float32')

array([1., 3., 4., 7.], dtype=float32)

## NumPy Standard Data Types

NumPy arrays contain values of a single type, so it is important to have detailed knowledge of those types and their limitations.<br>
Because NumPy is built in C, the types will be familiar to users of C, Fortran, and other related languages.<br>

The standard NumPy data types are listed in the following table.<br>
Note that when constructing an array, they can be specified using a string:<br>

```python
np.zeros(10, dtype='int16')
```

Or using the associated NumPy object:

```python
np.zeros(10, dtype=np.int16)
```

| Data type	    | Description |
|---------------|-------------|
| ``bool_``     | Boolean (True or False) stored as a byte |
| ``int_``      | Default integer type (same as C ``long``; normally either ``int64`` or ``int32``)| 
| ``intc``      | Identical to C ``int`` (normally ``int32`` or ``int64``)| 
| ``intp``      | Integer used for indexing (same as C ``ssize_t``; normally either ``int32`` or ``int64``)| 
| ``int8``      | Byte (-128 to 127)| 
| ``int16``     | Integer (-32768 to 32767)|
| ``int32``     | Integer (-2147483648 to 2147483647)|
| ``int64``     | Integer (-9223372036854775808 to 9223372036854775807)| 
| ``uint8``     | Unsigned integer (0 to 255)| 
| ``uint16``    | Unsigned integer (0 to 65535)| 
| ``uint32``    | Unsigned integer (0 to 4294967295)| 
| ``uint64``    | Unsigned integer (0 to 18446744073709551615)| 
| ``float_``    | Shorthand for ``float64``.| 
| ``float16``   | Half precision float: sign bit, 5 bits exponent, 10 bits mantissa| 
| ``float32``   | Single precision float: sign bit, 8 bits exponent, 23 bits mantissa| 
| ``float64``   | Double precision float: sign bit, 11 bits exponent, 52 bits mantissa| 
| ``complex_``  | Shorthand for ``complex128``.| 
| ``complex64`` | Complex number, represented by two 32-bit floats| 
| ``complex128``| Complex number, represented by two 64-bit floats| 

For more information, refer to the [NumPy documentation](http://numpy.org/).

There are two basic rules for every NumPy array.

Every element in the array must be of the same type and size.
If an array’s elements are also arrays, those inner arrays must have the same type and number of elements as each other. In other words, multidimensional arrays must be rectangular and not jagged.
For example, I can make a 1d array of integers and this is fine because every element in the array is an integer.


In [10]:
a = np.array([1, 3, 5])
a

array([1, 3, 5])

In [11]:
a[0] = 3.14
a.dtype

dtype('int32')

In [12]:
a

array([3, 3, 5])

In [13]:
#If I try to make an array from a list that contains a mix of integers and strings, watch what happens.
foo = np.array([1,'hello', 3])
foo

array(['1', 'hello', '3'], dtype='<U11')

In [14]:
foo[0] = 'a really really long string'
foo

array(['a really re', 'hello', '3'], dtype='<U11')

So, NumPy doesn’t error, but, it casts the integers to strings in order to satisfy the property that every element is the same type.

The dtype ‘<U11’ stands for unicode strings with 11 characters or less.

So if we try to reassign the 1st element to ‘a really really long string’

In [15]:
foo[0] = 'a really really long string'
foo

array(['a really re', 'hello', '3'], dtype='<U11')

You can see that ‘a really really long string’ gets truncated to 11 characters.

What if we try to build an array from a list of lists where the first inner list has four integers and the second inner list has four floats?

In [16]:
np.array([
    [1, 2, 3, 4],
    [5.5, 6.5, 5.5, 7.5],
])

array([[1. , 2. , 3. , 4. ],
       [5.5, 6.5, 5.5, 7.5]])

In this case, NumPy promotes the integers to floats, again to maintain homogenous data types.

Alright, let’s see one last example. What if we try to build an array from a list of lists where the first inner list has four integers and the second inner list has two integers?

In [17]:
np.array([
    [1, 2, 3, 4],
    [5, 6]
])

  np.array([


array([list([1, 2, 3, 4]), list([5, 6])], dtype=object)

In this case NumPy gives us a warning that says “Creating an ndarray from ragged nested sequences is deprecated”. But we do actually get back an array with dtype ‘object’. Now this just means you have an array of pointers, which is more or less the same as a standard python list.

### Array attributes
Let's discuss some useful array attributes. We'll start by defining three random arrays, a one-dimensional, two-dimensional, and three-dimensional array. We'll use NumPy's random number generator, which we will seed with a set value in order to ensure that the same random arrays are generated each time this code is run:

In [18]:
np.random.seed(120)  # seed for reproducibility

x1 = np.random.randint(10, size=6)  # One-dimensional array
x2 = np.random.randint(10, size=(3, 4))  # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5))  # Three-dimensional array
print(x1,'\n')
print(x2,'\n')
print(x3)

[7 0 8 4 1 7] 

[[1 3 8 9]
 [4 9 9 6]
 [7 9 1 9]] 

[[[4 5 7 2 2]
  [2 0 0 7 4]
  [2 2 4 8 5]
  [6 4 1 9 2]]

 [[0 5 7 5 7]
  [8 4 2 1 6]
  [9 5 6 1 5]
  [7 8 7 4 8]]

 [[9 7 8 5 6]
  [0 5 8 6 8]
  [5 4 1 9 4]
  [5 7 9 1 2]]]


Each array has attributes: <br>
-  ndim (the number of dimensions), 
-  shape (the size of each dimension), 
-  size (the total size of the array):

![image.png](attachment:image.png)

In [19]:
print(x1,'\n')

[7 0 8 4 1 7] 



In [20]:
print(x2,'\n')

[[1 3 8 9]
 [4 9 9 6]
 [7 9 1 9]] 



In [21]:
print("x1 ndim: ", x1.ndim)
print("x1 shape:", x1.shape)
print("x1 size: ", x1.size)
print("x2 ndim: ", x2.ndim)
print("x2 shape:", x2.shape)
print("x2 size: ", x2.size)
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)

x1 ndim:  1
x1 shape: (6,)
x1 size:  6
x2 ndim:  2
x2 shape: (3, 4)
x2 size:  12
x3 ndim:  3
x3 shape: (3, 4, 5)
x3 size:  60


Another useful attribute is the `dtype`, the data type of the array (which we discussed previously)

In [22]:
print("dtype:", x3.dtype)

dtype: int32


Other attributes include `itemsize`, which lists the size (in bytes) of each array element, and `nbytes`, which lists the total size (in bytes) of the array:

In [23]:
print("itemsize:", x3.itemsize, "bytes")
print("nbytes:", x3.nbytes, "bytes")

itemsize: 4 bytes
nbytes: 240 bytes


In general, we expect that `nbytes` is equal to `itemsize` times `size`.

In [24]:
x3.nbytes == x3.itemsize * x3.size

True

# Computation on NumPy Arrays: Universal Functions

Up until now, we have been discussing some of the basics of NumPy; now, we will dive into the reasons that NumPy is so important in the Python data science world.
Namely, it provides an easy and flexible interface to optimized computation with arrays of data.

Computation on NumPy arrays can be very fast, or it can be very slow.
The key to making it fast is to use *vectorized* operations, generally implemented through NumPy's *universal functions* (ufuncs).
This lesson motivates the need for NumPy's ufuncs, which can be used to make repeated calculations on array elements much more efficient.
It then introduces many of the most common and useful arithmetic ufuncs available in the NumPy package.

## The Slowness of Loops

Python's default implementation does some operations very slowly.
This is in part due to the dynamic, interpreted nature of the language: the fact that types are flexible, so that sequences of operations cannot be compiled down to efficient machine code as in languages like C and Fortran.<br>


# Timing Code

In the process of developing code and creating data processing pipelines, there are often trade-offs you can make between various implementations.
Early in developing your algorithm, it can be counterproductive to worry about such things. As Donald Knuth famously quipped, "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."

But once you have your code working, it can be useful to dig into its efficiency a bit.
Sometimes it's useful to check the execution time of a given command or set of commands; other times it's useful to dig into a multiline process and determine where the bottleneck lies in some complicated series of operations.
IPython provides access to functionality for this kind of timing and profiling of code.
Here we'll discuss the following IPython magic commands:

- ``%time``: Time the execution of a single statement
- ``%timeit``: Time repeated execution of a single statement for more accuracy


To get the information about all magic functions type and run:

In [35]:
%magic

## Timing Code Snippets: ``%timeit`` and ``%time``

We saw the ``%timeit`` line-magic and ``%%timeit`` cell-magic in the introduction to magic functions in Magic Commands; it can be used to time the repeated execution of snippets of code:

In [36]:
%timeit sum(range(100))

3.45 µs ± 296 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


Note that because this operation is so fast, ``%timeit`` automatically does a large number of repetitions.
For slower commands, ``%timeit`` will automatically adjust and perform fewer repetitions:

In [37]:
%%timeit
total = 0
for i in range(1000):
    for j in range(1000):
        total += i * (-1) ** j

915 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Sometimes repeating an operation is not the best option.
For example, if we have a list that we'd like to sort, we might be misled by a repeated operation.
Sorting a pre-sorted list is much faster than sorting an unsorted list, so the repetition will skew the result:

In [38]:
import random
L = [random.random() for i in range(100000)]
%timeit L.sort()

1.96 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


For this, the ``%time`` magic function may be a better choice. It also is a good choice for longer-running commands, when short, system-related delays are unlikely to affect the result.
Let's time the sorting of an unsorted and a presorted list:

In [39]:
import random
L = [random.random() for i in range(100000)]
print("sorting an unsorted list:")
%time L.sort()

sorting an unsorted list:
CPU times: total: 15.6 ms
Wall time: 31 ms


In [40]:
print("sorting an already sorted list:")
%time L.sort()

sorting an already sorted list:
CPU times: total: 0 ns
Wall time: 3 ms


Notice how much faster the presorted list is to sort, but notice also how much longer the timing takes with ``%time`` versus ``%timeit``, even for the presorted list!
This is a result of the fact that ``%timeit`` does some clever things under the hood to prevent system calls from interfering with the timing.
For example, it prevents cleanup of unused Python objects (known as *garbage collection*) which might otherwise affect the timing.
For this reason, ``%timeit`` results are usually noticeably faster than ``%time`` results.

For ``%time`` as with ``%timeit``, using the double-percent-sign cell magic syntax allows timing of multiline scripts:

In [41]:
%%time
total = 0
for i in range(1000):
    for j in range(1000):
        total += i ** j

CPU times: total: 59 s
Wall time: 59.8 s


For more information on ``%time`` and ``%timeit``, as well as their available options, use the IPython help functionality (i.e., type ``%time?``).

But sometimes it can be helpful just to use function time() from module time()

## About better perfomance of a ndarray
Performance of a list

In [42]:
import time 
import sys 
 
mul = 1024*1024*128 
 
a = time.time() 
x = [0]*mul 
b = time.time() 
print("Creation_Time: "+str(b-a)+" sec") 
 
size = sys.getsizeof(x) 
print("Size_on_RAM: "+str(size/(1024.0*1024.0))+" MB") 
 
mat = range(mul) 
a = time.time() 
for i in mat: 
    x[i] = 5 
b = time.time() 
print("Write_Time_taken: "+str(b-a)+" sec") 

# a = time.time() 
# [6 for i in x]
# b = time.time() 
# print("Write_Time_taken_List_comprehension: "+str(b-a)+" sec") 

a = time.time() 
for i in x: 
    pass 
b=time.time() 
print("Read_Time_taken: "+str(b-a)+" sec")  

a = time.time() 
max(x)
b = time.time() 
print("Read_Time_taken_core_function: "+str(b-a)+" sec") 

Creation_Time: 1.0080575942993164 sec
Size_on_RAM: 512.0000267028809 MB
Write_Time_taken: 40.15829682350159 sec
Read_Time_taken: 9.709555387496948 sec
Read_Time_taken_core_function: 6.069347143173218 sec


Performance of ndarray

In [43]:
import time 
import sys 
import numpy as np 
 
mul = 1024*1024*128 
 
a = time.time() 
x = np.zeros(mul,dtype=np.int8) 
b = time.time() 
print("Creation_Time: "+str(b-a)+" sec") 
 
size=sys.getsizeof(x) 
print("Size_on_RAM: "+str(size/(1024.0*1024.0))+" MB") 
 
    
mat=range(mul) 
a = time.time() 
for i in mat: 
    x[i] = 5
b = time.time()
print("Write_Time_taken: "+str(b-a)+" sec") 
 
a = time.time() 
for i in x: 
    pass 
b = time.time() 
print("Read_Time_taken: "+str(b-a)+" sec") 

a = time.time() 
max(x)
b = time.time() 
print("Read_Time_taken_core_function: "+str(b-a)+" sec") 


Creation_Time: 0.5830333232879639 sec
Size_on_RAM: 128.00005340576172 MB
Write_Time_taken: 58.807363748550415 sec
Read_Time_taken: 31.96382784843445 sec
Read_Time_taken_core_function: 23.039317846298218 sec


It takes several seconds to compute these millions operations and to store the result! When even cell phones have processing speeds measured in Giga-FLOPS (i.e., billions of numerical operations per second), this seems almost absurdly slow. It turns out that the bottleneck here is not the operations themselves, but the type-checking and function dispatches that Python must do at each cycle of the loop. Each time the output is computed, Python first examines the object's type and does a dynamic lookup of the correct function to use for that type. If we were working in compiled code instead, this type specification would be known before the code executes and the result could be computed much more efficiently.

In [44]:
a = time.time() 
x[:] = 5
b = time.time() 
print("Write_Time_taken_using_numpy: "+str(b-a)+" sec")

a = time.time() 
np.max(x)
b = time.time() 
print("Read_Time_taken_using numpy: "+str(b-a)+" sec") 

Write_Time_taken_using_numpy: 0.09300518035888672 sec
Read_Time_taken_using numpy: 0.06000351905822754 sec


## Introducing UFuncs

For many types of operations, NumPy provides a convenient interface into just this kind of statically typed, compiled routine. This is known as a *vectorized* operation.
This can be accomplished by simply performing an operation on the array, which will then be applied to each element.
This vectorized approach is designed to push the loop into the compiled layer that underlies NumPy, leading to much faster execution.


Vectorized operations in NumPy are implemented via *ufuncs*, whose main purpose is to quickly execute repeated operations on values in NumPy arrays.
Ufuncs are extremely flexible – before we saw an operation between a scalar and an array, but we can also operate between two arrays:

In [45]:
np.arange(5) / np.arange(1, 6)

array([0.        , 0.5       , 0.66666667, 0.75      , 0.8       ])