# Numerical Computing with Python and Numpy

### Part 6 of "Data Analysis with Python: Zero to Pandas"

This tutorial is the sixth in a series on introduction to programming and data analysis using the Python language. These tutorials take a practical coding-based approach, and the best way to learn the material is to execute the code and experiment with the examples. Check out the full series here: 

1. [First Steps with Python and Jupyter](https://jovian.ml/aakashns/first-steps-with-python)
2. [A Quick Tour of Variables and Data Types](https://jovian.ml/aakashns/python-variables-and-data-types)
3. [Branching using Conditional Statements and Loops](https://jovian.ml/aakashns/python-branching-and-loops)
4. [Writing Reusable Code Using Functions](https://jovian.ml/aakashns/python-functions-and-scope)
5. [Reading from and Writing to Files](https://jovian.ml/aakashns/python-os-and-filesystem)
6. [Numerical with Python and Numpy](https://jovian.ml/aakashns/numerical-computing-with-numpy)



## How to run the code

This tutorial hosted on [Jovian.ml](https://www.jovian.ml), a platform for sharing data science projects online. You can "run" this tutorial and experiment with the code examples in a couple of ways: *using free online resources* (recommended) or *on your own computer*.

>  This tutorial is a [Jupyter notebook](https://jupyter.org) - a document made of "cells", which can contain explanations in text or code written in Python. Code cells can be executed and their outputs e.g. numbers, messages, graphs, tables, files etc. can be viewed within the notebook, which makes it a really powerful platform for experimentation and analysis. Don't afraid to experiment with the code & break things - you'll learn a lot by encoutering and fixing errors. You can use the "Kernel > Restart & Clear Output" menu option to clear all outputs and start again from the top of the notebook.

### Option 1: Running using free online resources (1-click, recommended)

The easiest way to start executing this notebook is to click the "Run" button at the top of this page, and select "Run on Binder". This will run the notebook on [mybinder.org](https://mybinder.org), a free online service for running Jupyter notebooks. You can also select "Run on Colab" or "Run on Kaggle", but you'll need to create an account on [Google Colab](https://colab.research.google.com) or [Kaggle](https://kaggle.com) to use these platforms.


### Option 2: Running on your computer locally

You'll need to install Python and download this notebook on your computer to run in locally. We recommend using the [Conda](https://docs.conda.io/en/latest/) distribution of Python. Here's what you need to do to get started:

1. Install Conda by [following these instructions](https://conda.io/projects/conda/en/latest/user-guide/install/index.html). Make sure to add Conda binaries to your system `PATH` to be able to run the `conda` command line tool from your Mac/Linux terminal or Windows command prompt. 


2. Create and activate a [Conda virtual environment](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) called `zerotopandas` which you can use for this tutorial series:
```
conda create -n zerotopandas -y python=3.8 
conda activate zerotopandas
```
You'll need to create the environment only once, but you'll have to activate it every time want to run the notebook. When the environment is activated, you should be able to see a prefix `(numerical-computing-with-numpy)` within your terminal or command prompt.


3. Install the required Python libraries within the environmebt by the running the following command on your  terminal or command prompt:
```
pip install jovian jupyter numpy pandas matplotlib seaborn --upgrade
```

4. Download the notebook for this tutorial using the `jovian clone` command:
```
jovian clone aakashns/numerical-computing-with-numpy
```
The notebook is downloaded to the directory `numerical-computing-with-numpy`.


5. Enter the project directory and start the Jupyter notebook:
```
cd numerical-computing-with-numpy
jupyter notebook
```

6. You can now access Jupyter's web interface by clicking the link that shows up on the terminal or by visiting http://localhost:8888 on your browser. Click on the notebook `numerical-computing-with-numpy.ipynb` to open it and run the code. If you want to type out the code yourself, you can also create a new notebook using the "New" button.


## Creating and using Numpy Arrays

The *data* in *Data Analysis* typically refers to numerical data e.g. stock prices, sales figures, sensor measurements, database tables etc. The [Numpy](https://numpy.org) provides specialized data structures, functions and other tools for numerical computing in Python. Let's work through an example to see why & how to use Numpy for working with numerical data.


> Let's say we want to use climate data like the temperature, rainfall and humidity in a region to determine if the region is well suited for growing apples. A really simple approach for doing this would be to formulate the relationship between the annual yield of apples (tons per hectare) and the climatic conditions like the average temperature (in degrees Farenheit), rainfall (in centimeters) & average relative humidity (in percentage) as a linear equation.
>
> `yield_of_apples = w1 * temperature + w2 * rainfall + w3 * humidity


We're expressing the yield of apples as a weighted sum of the temperature, rainfall and humidity. Based on some statical analysis of historical data, we might we able to come up with reasonable values for the weights `w1`, `w2` and `w3`.

In [1]:
w1, w2, w3 = 0.3, 0.2, 0.5

Given some climate data for a region, we can now predict what the yield of apples in the region might look like. Here's some sample data:

<img src="https://i.imgur.com/TXPBiqv.png" style="max-width:360px;">

We can use variables to store the climate data for a region.

In [2]:
kanto_temp = 73
kanto_rainfall = 67
kanto_humidity = 43

These numbers can now we substituted into the formula to get the predicted yield of apples.

In [3]:
kanto_yield_apples = kanto_temp * w1 + kanto_rainfall * w2 + kanto_humidity * w3
kanto_yield_apples

56.8

In [4]:
print("The expected yield of apples in Kanto region is {} tons per hectare.".format(kanto_yield_apples))

The expected yield of apples in Kanto region is 56.8 tons per hectare.


To make it slightly easier to perform the above computation for multiple regions, we can represent the climate data for each region as a vector i.e. a list of numbers.

In [5]:
kanto = [73, 67, 43]
johto = [91, 88, 64]
hoenn = [87, 134, 58]
sinnoh = [102, 43, 37]
unova = [69, 96, 70]

The three numbers in each vector represent the temperature, rainfall and humidity data respecively. 

The set of weights to be used in the forumla can also be represented as a vector.

In [6]:
weights = [w1, w2, w3]

We can now write a function `crop_yield` to calcuate the yield of apples (or any other crop) given the climate data and the respective weights.

In [7]:
def crop_yield(region, weights):
    result = 0
    for x, w in zip(region, weights):
        result += x * w
    return result

In [8]:
crop_yield(kanto, weights)

56.8

In [9]:
crop_yield(johto, weights)

76.9

In [10]:
crop_yield(unova, weights)

74.9

The calculation performed by the `crop_yield` (element-wise multiplication of two vectors, and taking a sum of the results) is also called the *dot product* of the two vectors. Learn more about dot product here: https://www.khanacademy.org/math/linear-algebra/vectors-and-spaces/dot-cross-products/v/vector-dot-product-and-vector-length

The Numpy library provides a built-in function to perform the dot product of two vectors. However, the lists must first be converted to numpy arrays before we can perform the operation. To begin, let's import the `numpy` module. It is common practice to import numpy with the alias `np`.

In [11]:
import numpy as np

Numpy arrays can be created using the `np.array` function.

In [12]:
kanto = np.array([73, 67, 43])

In [13]:
kanto

array([73, 67, 43])

In [14]:
weights = np.array([w1, w2, w3])

In [15]:
weights

array([0.3, 0.2, 0.5])

In [16]:
type(kanto)

numpy.ndarray

We can now compute the dot product of the two vectors using the `np.dot` function

In [17]:
np.dot(kanto, weights)

56.8

We can achieve the same result with lower level operations supported by Numpy arrays: performing an element-wise multiplication and calculating the sum of the resulting numbers.

In [18]:
(kanto * weights).sum()

56.8

The `*` operator performs an element-wise multiplication of two arrays (assuming they have the same size), and the `sum` method calcuates the sum of numbers in an array.

In [19]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

In [20]:
arr1 * arr2

array([ 4, 10, 18])

In [21]:
arr2.sum()

15

There are a couple of important benefits of using Numpy arrays instead of Python lists for operating on vectors:

- You can use small, concise and intutive mathematical expressions like `(kanto * weights).sum()` rather than using loops & custom functions like `crop_yeild`.
- Numpy operations and functions are implemented internally in C++, which makes them much faster than using Python statements & loops which are interpreted at runtime

Here's a quick comparision of dot products done of vectors with a million elements each using Python loops vs. Numpy arrays.

In [22]:
# Python lists
arr1 = list(range(1000000))
arr2 = list(range(1000000, 2000000))

# Numpy arrays
arr1_np = np.array(arr1)
arr2_np = np.array(arr2)

In [23]:
%%time
result = 0
for x1, x2 in zip(arr1, arr2):
    result += x1*x2
result

CPU times: user 141 ms, sys: 1.8 ms, total: 142 ms
Wall time: 142 ms


833332333333500000

In [24]:
%%time
np.dot(arr1_np, arr2_np)

CPU times: user 1.47 ms, sys: 513 µs, total: 1.99 ms
Wall time: 1.11 ms


833332333333500000

As you can see, using `np.dot` is 100 times faster than using a for loop. This makes Numpy especially useful while working with really large datasets.

We can now go one step further, and represent the climate data for all the regions together using a single 2-dimensional Numpy array.

In [25]:
climate_data = np.array([[73, 67, 43],
                         [91, 88, 64],
                         [87, 134, 58],
                         [102, 43, 37],
                         [69, 96, 70]])

In [26]:
climate_data

array([[ 73,  67,  43],
       [ 91,  88,  64],
       [ 87, 134,  58],
       [102,  43,  37],
       [ 69,  96,  70]])

If you've taken linear algebra in high school, you might recognize the above 2-d array as a *matrix* with 5 rows (one for each region) and 3 columns (containing values for temperature, rainfall and humidity). You can learn more about matrices here: https://www.khanacademy.org/math/algebra-home/alg-matrices

We can now compute the predicted yields of apples in all the regions, using a single matrix multiplication.

<img src="https://i.imgur.com/LJ2WKSI.png" width="240">

We can use the `np.matmul` function from numpy, or simply use the `@` operator to perform matrix multiplication.

In [27]:
np.matmul(climate_data, weights)

array([56.8, 76.9, 81.9, 57.7, 74.9])

In [28]:
climate_data @ weights

array([56.8, 76.9, 81.9, 57.7, 74.9])

Numpy provides hundreds of operators and functions to manipulate arrays. Check out the official quickstart tutorial to learn more: https://numpy.org/doc/stable/user/quickstart.html

### Properties of numpy arrays

We've already seen that Python lists (or lists of lists, lists of lists of lists etc.) can be convered into numpy arrays using the `np.array` function.

In [29]:
# 1D array
arr1 = np.array([1, 2, 3, 4, 5])

In [30]:
arr1

array([1, 2, 3, 4, 5])

In [31]:
# 2D array
arr2 = np.array([[1, 2, 3, 4], 
          [5, 6, 7, 8], 
          [9, 1, 2, 3]])

In [32]:
arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8],
       [9, 1, 2, 3]])

In [33]:
# 3D array
arr3 = np.array([
    [[11, 12, 13], 
     [13, 14, 15]], 
    [[15, 16, 17], 
     [17, 18, 19.5]]])

In [34]:
arr3

array([[[11. , 12. , 13. ],
        [13. , 14. , 15. ]],

       [[15. , 16. , 17. ],
        [17. , 18. , 19.5]]])

Numpy arays can have any number of dimensions, and different lengths along each dimension. We can inspect the length along each dimension using the .shape property of an array.

In [35]:
arr1

array([1, 2, 3, 4, 5])

In [36]:
arr1.shape

(5,)

In [37]:
arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8],
       [9, 1, 2, 3]])

In [38]:
arr2.shape

(3, 4)

In [39]:
arr3

array([[[11. , 12. , 13. ],
        [13. , 14. , 15. ]],

       [[15. , 16. , 17. ],
        [17. , 18. , 19.5]]])

In [40]:
arr3.shape

(2, 2, 3)

All the elements in a numpy array have the same datatype. You can check the data type of an array using the `.dtype` property

In [41]:
arr1.dtype

dtype('int64')

In [42]:
arr2.dtype

dtype('int64')

If an array contains even a single floating point number, all the other elements are also coverted to floats.

In [43]:
arr3 = np.array([
    [[11, 12, 13], 
     [13, 14, 15]], 
    [[15, 16, 17], 
     [17, 18, 19.5]]])

In [44]:
arr3

array([[[11. , 12. , 13. ],
        [13. , 14. , 15. ]],

       [[15. , 16. , 17. ],
        [17. , 18. , 19.5]]])

In [45]:
arr3.dtype

dtype('float64')

Numpy also provides some handy functions to create arrays of a desired shape with fixed or random values. Check the out the [official documentation](https://numpy.org/doc/stable/reference/routines.array-creation.html) or use the `help` function to learn more about the following functions.

In [46]:
# All zeros
np.zeros((3, 2))

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

In [47]:
# All ones
np.ones([2, 2, 3])

array([[[1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.]]])

In [48]:
# Identity matrix
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [49]:
# Random vector
np.random.rand(5)

array([0.0832852 , 0.23567451, 0.82424802, 0.53772511, 0.06858706])

In [50]:
# Random matrix
np.random.randn(2, 3) # rand vs. randn - what's the difference?

array([[-1.23742141, -0.24489737,  0.70464334],
       [-0.37155842,  0.42725119,  1.000889  ]])

In [51]:
# Fixed value
np.full([2, 3], 42)

array([[42, 42, 42],
       [42, 42, 42]])

In [52]:
# Range with start, end and step
np.arange(10, 90, 3)

array([10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43, 46, 49, 52, 55, 58,
       61, 64, 67, 70, 73, 76, 79, 82, 85, 88])

In [53]:
# Equally spaced numbers in a range
np.linspace(3, 27, 9)

array([ 3.,  6.,  9., 12., 15., 18., 21., 24., 27.])

### Array indexing and slicing

Numpy extends Python's list indexing notation using `[]` to multiple dimensions in a fairly intuitive fashion. You can provide a comma separated list of indices or ranges to select a specific element or a subarray/slice from an numpy array.

In [54]:
arr3 = np.array([
    [[11, 12, 13, 14], 
     [13, 14, 15, 19]], 
    
    [[15, 16, 17, 21], 
     [63, 92, 36, 18]], 
    
    [[98, 32, 81, 23],      
     [17, 18, 19.5, 43]]])

In [55]:
arr3.shape

(3, 2, 4)

In [56]:
# Single element
arr3[1, 1, 2]

36.0

In [57]:
# Subarray using ranges
arr3[1:, 0:1, :2]

array([[[15., 16.]],

       [[98., 32.]]])

In [58]:
# Mixing indices and ranges
arr3[1:, 1, 3]

array([18., 43.])

In [59]:
# Mixing indices and ranges
arr3[1:, 1, :3]

array([[63. , 92. , 36. ],
       [17. , 18. , 19.5]])

In [60]:
# Using fewer indices
arr3[1]

array([[15., 16., 17., 21.],
       [63., 92., 36., 18.]])

In [61]:
# Using fewer indices
arr3[:2, 1]

array([[13., 14., 15., 19.],
       [63., 92., 36., 18.]])

In [62]:
# Using too many indices
arr3[1,3,2,1]

IndexError: too many indices for array: array is 3-dimensional, but 4 were indexed

The notation and results can confusing at first, so take your time to experiment and become comfortable with it. Use the cells below to try out some examples of array indexing and slicing, with different combinations of indices and ranges.

Numpy arrays support common arithmetic operators. You can perform an arithmetic operation with a single number (also called scalar), or another array of the same shape.

In [63]:
arr2 = np.array([[1, 2, 3, 4], 
                 [5, 6, 7, 8], 
                 [9, 1, 2, 3]])

In [64]:
arr3 = np.array([[11, 12, 13, 14], 
                 [15, 16, 17, 18], 
                 [19, 11, 12, 13]])

In [65]:
# Adding a scalar
arr2 + 3

array([[ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12,  4,  5,  6]])

In [67]:
# Element-wise subtraction
arr3 - arr2

array([[10, 10, 10, 10],
       [10, 10, 10, 10],
       [10, 10, 10, 10]])

In [68]:
# Division by scalar
arr2 / 2

array([[0.5, 1. , 1.5, 2. ],
       [2.5, 3. , 3.5, 4. ],
       [4.5, 0.5, 1. , 1.5]])

In [69]:
# Element-wise multiplication
arr2 * arr3

array([[ 11,  24,  39,  56],
       [ 75,  96, 119, 144],
       [171,  11,  24,  39]])

In [72]:
# Modulus with scalar
arr2 % 4

array([[1, 2, 3, 0],
       [1, 2, 3, 0],
       [1, 1, 2, 3]])

Numpy arrays also support *brodcasting*, which allows arthmetic operations between two array having a different number of dimensions, but compatible shapes. Let's look at an example to see how it works.

In [73]:
arr2 = np.array([[1, 2, 3, 4], 
                 [5, 6, 7, 8], 
                 [9, 1, 2, 3]])

In [77]:
arr2.shape

(3, 4)

In [79]:
arr4 = np.array([4, 5, 6, 7])

In [80]:
arr4.shape

(4,)

In [76]:
arr2 + arr4

array([[ 5,  7,  9, 11],
       [ 9, 11, 13, 15],
       [13,  6,  8, 10]])

When the expression `arr2 + arr4` is evaluated, `arr4` (which has the shape `(4,)`) is replicated 3 times to match the shape `(3, 4)` of `arr2`. This is pretty useful, because numpy performs the replication without actually creating 3 copies of the smaller dimension array

In [83]:
arr5 = np.array([7, 8, 9])

In [84]:
arr5.shape

(3,)

In [85]:
arr2 + arr5

ValueError: operands could not be broadcast together with shapes (3,4) (3,) 

### Linear Algebra

### Statistics & Aggregation

TODO 

- an example to motivate (linear model?)
- show how to create an array
- types of operations supported:
    - arithmetic
    - linear algebra
    - statistics
    - aggergation - max, predicate, reshape, stacking, splitting etc.
- differences between numpy over python
    - much faster
    - support

### Working with Numpy Arrays

While working with data, we often need to work with vectors and matrices. That's where numpy comes in.

Let's say we're trying to use weather data to predict the yield of apples in a region. We might model this relationship using a simple linear equation.

```
yeild_of_apples = w1 * temperature + w2 * rainfall + w3 * humidity
```

We're expressing the yield of apples as a weighted sum of the temperature, rainfall and humidity. Obviously, this isn't a perfect relationship.

Let's say we've come up with a set of weights `w1 = 0.2`, `w2 = 0.3`, `w3 = 0.4` after some research and experimentation. The first step for any such model would be back testing. 


Let's say we have historical weather data for 5 regions:

![](https://i.imgur.com/6Ujttb4.png)

We can use the weights to calculate the weights


Flow:
- Represent as matrices & perform operations with loops
  - matrix multiplication as an exercise??
- Represent as arrays (also cover other ways, also cover indexing)
   - other types of arrays - zero, ones, random
- Show matrix operations (without loops)
- Aggregation (mean, max etc.)
- Broadcasting
- show an example with a large file of numbers (??)

Next thing you should do: Numpy 100
Assignment - 5 numpy functions + blog post