# Where to start learning?

<a href = "https://numpy.org/learn/"> Numpy - Getting Started (Official) </a>

## What is Numpy? 

<b>Numerical Python</b>(NumPy) is the fundamental package for scientific computing in Python. It is a Python library that provides a <b>multidimensional array object</b>, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

At the core of the NumPy package, is the <u>ndarray object</u>. This encapsulates n-dimensional arrays of <u>homogeneous data types</u>, with many <u>operations being performed in compiled code for performance.</u> There are several important differences between NumPy arrays and the standard Python sequences:
<pre>
i) NumPy arrays have a fixed size at creation, unlike Python lists (which can grow dynamically). Changing the size of an ndarray will create a new array and delete the original.

ii) The elements in a NumPy array are all required to be of the same data type, and thus will be the same size in memory. The exception: one can have arrays of (Python, including NumPy) objects, thereby allowing for arrays of different sized elements.

iii) NumPy arrays facilitate advanced mathematical and other types of operations on large numbers of data. Typically, such operations are executed more efficiently and with less code than is possible using Python’s built-in sequences.
</pre>

A growing plethora of scientific and mathematical Python-based packages are using NumPy arrays; though these typically support Python-sequence input, they convert such input to NumPy arrays prior to processing, and they often output NumPy arrays. Arrays are also frequently used in Data Science where speed and optimal utiization of resources is very important.

In [1]:
import numpy as np

In [2]:
np.__version__

'1.20.1'

In [3]:
np.version.version

'1.20.1'

## Why is Numpy faster?

<ul>
    <li>Numpy uses the technique of <b><u>Vectorization</u></b> which describes the absence of any explicit looping, indexing, etc., in the code - these things are taking place, of course, just “behind the scenes” in optimized, pre-compiled C code.</li>
<li>Numpy arrays are stored in contiguous memory locations whereas lists are not. Contiguous refers to sections of memory that are next to one another.
</li>
<li>Numpy package breaks down a task into multiple fragments and process all fragments parallely. Also, it is optimized to work with latest CPU architectures.
    </li>
</ul>

<b>Note</b>: ***Numpy arrays show a reasonable reduction in speeds only when the number of observations involved are very large.***

### <u>Problem statement</u> - Convert a random list of temperatures in Celcius to Fahrenheit. Compare the performances of various method and show how numpy arrays work better with larger data.

In [4]:
ten_temp_celcius = [np.random.randint(5, 50) for _ in range(10)]
print(ten_temp_celcius)

[16, 34, 32, 33, 16, 15, 28, 27, 48, 33]


<b>Let's convert this list of temperatures into Farenheit using three different methods:
<pre>
i) using loops
ii) using list comprehensions
iii) using vectorized code</pre>

### Solution:

#### Method - 1 Loops

In [5]:
temp_in_far_l = []

for temp in ten_temp_celcius:
    temp_in_far_l.append(9/5*temp + 32)

print(temp_in_far_l)

[60.8, 93.2, 89.6, 91.4, 60.8, 59.0, 82.4, 80.6, 118.4, 91.4]


#### Method - 2 : List Comprehensions

In [6]:
temp_in_far_lc = [(c*9/5 + 32) for c in ten_temp_celcius]
print(temp_in_far_lc)

[60.8, 93.2, 89.6, 91.4, 60.8, 59.0, 82.4, 80.6, 118.4, 91.4]


#### Method 3 - Vectorized Solution

In [7]:
temp_in_far_v = (np.array(ten_temp_celcius)*9/5 + 32)
print(temp_in_far_v)

[ 60.8  93.2  89.6  91.4  60.8  59.   82.4  80.6 118.4  91.4]


### Comparing Performances - 1 (with 10 data points)

In [8]:
import timeit

In [9]:
# Checking performance of loops

method_loops = """
temp_in_far_l = []

for temp in ten_temp_celcius:
    temp_in_far_l.append(9/5*temp + 32)
"""

method_listc = """
temp_in_far_lc = [(c*9/5 + 32) for c in ten_temp_celcius]
"""

method_vc = """
temp_in_far_v = (np.array(ten_temp_celcius)*9/5 + 32)
"""

In [10]:
# Method - 1

timeit.timeit(method_loops, setup='import random; ten_temp_celcius = [random.randint(5, 50) for _ in range(10)]')

2.4717087999999996

In [11]:
# Method - 2

timeit.timeit(method_listc, setup='import random; ten_temp_celcius = [random.randint(5, 50) for _ in range(10)]')

2.653466400000001

In [12]:
# Method  - 3

timeit.timeit(method_vc, setup = "import numpy as np; ten_temp_celcius = [np.random.randint(5, 50) for _ in range(10)]")

6.5340723999999994

***We can see from the above example that when we considered only 10 data points, the vectorized code took the longest to run. Let's increase the size of the array and repeat the whole thing again.***

### Comparing Performances - 2 (with 100000 data points)

In [13]:
# Checking performance of loops

method_loops_100000 = """
temp_in_far_l = []

for temp in ht_temp_celcius:
    temp_in_far_l.append(9/5*temp + 32)
"""

method_listc_100000 = """
temp_in_far_lc = [(c*9/5 + 32) for c in ht_temp_celcius]
"""

method_vc_100000 = """
temp_in_far_v = (np.array(ht_temp_celcius)*9/5 + 32)
"""

In [14]:
timeit.timeit(method_loops_100000, number=1000, setup='import random; ht_temp_celcius = [random.randint(0, 100) for _ in range(100000)]')

19.732050199999996

In [15]:
timeit.timeit(method_listc_100000, number=1000, setup='import random; ht_temp_celcius = [random.randint(0, 100) for _ in range(100000)]')

22.370302699999996

In [16]:
timeit.timeit(method_vc_100000, number=1000, setup='import numpy as np; ht_temp_celcius = np.array([np.random.randint(0, 100) for _ in range(100000)])')

1.6267635000000027

***As we can see from the above comparison that that this time the vectorized solution is the fastest among all. This is because of the number of data points we had (100000) compared to 10 in the last comparison.***

## Vectorization vs Broadcasting

### Vectorization:
<p>
LOOPS ARE SLOW. Python lists can have any data type. What if we can restrict our lists to have only one data type that we can let Python know in advance? Numpy does something similar. It allows arrays to only have a single data type and stores the data internally in a contiguous block of memory.</p>

<p>
    <u>VECTORIZATION</u> is a powerful ability withing NumPy to express operations as occuring on <u> entire arrays </u> rather than individual elements.
</p>

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h3> Here’s a concise definition from Wes McKinney: </h3>

<b> 

This practice of replacing explicit loops with array expressions is commonly referred to as vectorization. In general, vectorized array operations will often be one or two (or more) orders of magnitude faster than their pure Python equivalents, with the biggest impact [seen] in any kind of numerical computations. </b>

</div>


In [17]:
np.array([5,1,4,9,5,6]) * -5                        # vectorization

array([-25,  -5, -20, -45, -25, -30])

### Broadcasting:

<p>
    Another feature of Numpy Abstraction. The operations between equally sized numpy arrays work great. Like:
</p>

In [18]:
a = np.array([7,25,35,152])
b = np.array([8,6,91,52])

a/b

array([0.875     , 4.16666667, 0.38461538, 2.92307692])

***But what if the arrays are unequally sized? This is where broadcasting comes in!***

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h3> Here’s a concise definition from the documentation: </h3>

<b> 

The term broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. </b>

</div>

<p>
The way in which broadcasting is implemented can become tedious when working with more than two arrays. However, if there are just two arrays, then their ability to be broadcasted can be described with two short rules:

When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions and works its way forward. Two dimensions are compatible when:
<pre>
<b>1. they are equal, or
2. one of them is 1</b>
</pre>
</p>

In [19]:
x = np.array([1.0, 2.0, 3.0])
y = np.array([2.0, 2.0, 2.0])                                 
x*y                                                                        # Broadcasting

array([2., 4., 6.])

In [20]:
%timeit x*y                                 # In Ipython console, we can use this to measure the time elapsed

763 ns ± 16.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


Vectorization and Broadcasting may seem like the same thing and they are to a certain degree. Say if we are multiplying two arrays, NumPy delegates the loop to pre-compiled, optimized C code under the hood. This process is called vectorization of the multiplication operator. Vectorization avoids using of loop which results in the code running faster. <u>Broadcasting, defines how arithmetic operations are to be performed on arrays of unequal size.</u>

In [21]:
br = np.array([1,64,1,65,65])

Let's say we want to add 4 to all the elements of the array.

If we do that, then the element 4 is stretched to [4,4,4,4,4] to match the size of the array we are adding to. In other words, the smaller array is broadcast over the larger one.

In [22]:
br + 2

array([ 3, 66,  3, 67, 67])

Now, Numpy doesn't actually create copies of the elements and create a new "stretched" array but raher the inherent computations are such that the the operation could be performed across all the elements of the array. 

<a href = "https://blog.paperspace.com/numpy-optimization-vectorization-and-broadcasting/"> Vectorization and Broadcasting </a>

<a href = "https://www.youtube.com/watch?v=0u9OzBSRZec">More on Broadcasting </a>

<a href = "https://realpython.com/numpy-array-programming/">Numpy Array Programming</a>