# Numpy




Suppose we want to use climate data like the temperature, rainfall, and humidity to determine if a region is well suited for growing apples. A simple approach for doing this would be to formulate the relationship between the annual yield of apples (tons per hectare) and the climatic conditions like the average temperature (in degrees Fahrenheit), rainfall (in millimeters) & average relative humidity (in percentage) as a linear equation.

`yield_of_apples = w1 * temperature + w2 * rainfall + w3 * humidity`

Based on some statical analysis of historical data, we might come up with reasonable values for the weights `w1`, `w2`, and `w3`

In [1]:
w1, w2, w3 = 0.3, 0.2, 0.5

In [2]:
kanto_temp = 73
kanto_rainfall = 67
kanto_humidity = 43

Taking an example of `kanto`

In [3]:
kanto_yield_apples = kanto_temp * w1 + kanto_rainfall * w2 + kanto_humidity * w3
kanto_yield_apples

56.8

In [4]:
print("The expected yield of apples in Kanto region is {} tons per hectare.".format(kanto_yield_apples))

The expected yield of apples in Kanto region is 56.8 tons per hectare.


For easier calculations, let's convert the given data into lists

In [5]:
kanto = [73, 67, 43]
johto = [91, 88, 64]
hoenn = [87, 134, 58]
sinnoh = [102, 43, 37]
unova = [69, 96, 70]

The three numbers in each vector represent the temperature, rainfall, and humidity data, respectively.

We can also represent the set of weights used in the formula as a vector.

In [6]:
weights = [w1, w2, w3]

We can now write a function `crop_yield` to calcuate the yield of apples (or any other crop) given the climate data and the respective weights.

In [7]:
def crop_yield(region, weights):
    result = 0
    for x, w in zip(region, weights):
        result += x * w
    return result

In [8]:
crop_yield(kanto, weights)

56.8

In [9]:
crop_yield(johto, weights)

76.9

In [10]:
crop_yield(unova, weights)

74.9

Next, let's import the `numpy` module. It's common practice to import numpy with the alias `np`.

In [11]:
import numpy as np

In [12]:
kanto = np.array([73, 67, 43])

In [13]:
kanto

array([73, 67, 43])

In [14]:
weights = np.array([w1, w2, w3])

In [15]:
weights

array([0.3, 0.2, 0.5])

Numpy arrays have the type `ndarray`.

In [16]:
type(kanto)

numpy.ndarray

Just like lists, Numpy arrays support the indexing notation `[]`.

In [17]:
weights[0]

0.3

In [18]:
kanto[2]

43

## Operating on Numpy arrays

We can compute the dot product of the two vectors using the `np.dot` function.

In [19]:
np.dot(kanto, weights)

56.8

In [20]:
(kanto * weights).sum()

56.8

The `*` operator performs an element-wise multiplication of two arrays if they have the same size. The `sum` method calculates the sum of numbers in an array.

In [21]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

In [22]:
arr1 * arr2

array([ 4, 10, 18])

In [23]:
arr2.sum()

15

## Benefits of using Numpy arrays

Numpy arrays offer the following benefits over Python lists for operating on numerical data:

- **Ease of use**: You can write small, concise, and intuitive mathematical expressions like `(kanto * weights).sum()` rather than using loops & custom functions like `crop_yield`.
- **Performance**: Numpy operations and functions are implemented internally in C++, which makes them much faster than using Python statements & loops that are interpreted at runtime

Here's a comparison of dot products performed using Python loops vs. Numpy arrays on two vectors with a million elements each.

In [24]:
# Python lists
arr1 = list(range(1000000))
arr2 = list(range(1000000, 2000000))

# Numpy arrays
arr1_np = np.array(arr1)
arr2_np = np.array(arr2)

`%time` and `%%time` commands provide you with information about the execution time of a single statement or an entire cell, respectively.

`zip()` is a built-in Python function that is used to combine two or more iterable objects (e.g., lists, tuples, or other sequences) element-wise, creating an iterator that generates tuples containing elements from the input iterables at the same position. It continues until the shortest input iterable is exhausted.

In [25]:
%%time
result = 0
for x1, x2 in zip(arr1, arr2):
    result += x1*x2
result

CPU times: user 443 ms, sys: 1.21 ms, total: 444 ms
Wall time: 455 ms


833332333333500000

In [26]:
%%time
np.dot(arr1_np, arr2_np)

CPU times: user 1.58 ms, sys: 0 ns, total: 1.58 ms
Wall time: 1.59 ms


833332333333500000

As you can see, using `np.dot` is 100 times faster than using a `for` loop. This makes `Numpy` especially useful while working with really large datasets with tens of thousands or millions of data points.


## Multi-dimensional Numpy arrays

We can represent the climate data for all the regions using a single 2-dimensional Numpy array.

In [27]:
climate_data = np.array([[73, 67, 43],
                         [91, 88, 64],
                         [87, 134, 58],
                         [102, 43, 37],
                         [69, 96, 70]])

In [28]:
climate_data

array([[ 73,  67,  43],
       [ 91,  88,  64],
       [ 87, 134,  58],
       [102,  43,  37],
       [ 69,  96,  70]])

Above shown 2-d array is a matrix with five rows and three columns. Each row represents one region, and the columns represent temperature, rainfall, and humidity, respectively.

Numpy arrays can have any number of dimensions and different lengths along each dimension. We can inspect the length along each dimension using the `.shape` property of an array.


In [29]:
# 2D array (matrix)
climate_data.shape

(5, 3)

In [30]:
weights

array([0.3, 0.2, 0.5])

In [31]:
# 1D array (vector)
weights.shape

(3,)

In [32]:
# 3D array
arr3 = np.array([
    [[11, 12, 13],
     [13, 14, 15]],
    [[15, 16, 17],
     [17, 18, 19.5]]])

In [33]:
arr3.shape

(2, 2, 3)

All the elements in a numpy array have the same data type.The data type of an array can be checked using the `.dtype` property.

In [34]:
weights.dtype

dtype('float64')

In [35]:
climate_data.dtype

dtype('int64')

If an array contains even a single floating point number, all the other elements are also converted to floats.

The `np.matmul` function or the `@` operator acn be used to perform matrix multiplication.

In [36]:
np.matmul(climate_data, weights)

array([56.8, 76.9, 81.9, 57.7, 74.9])

In [37]:
climate_data @ weights

array([56.8, 76.9, 81.9, 57.7, 74.9])

## Working with CSV data files

Numpy also provides helper functions reading from & writing to files.The file `climate.txt`, contains 10,000 climate measurements (temperature, rainfall & humidity) in the following format:


```
temperature,rainfall,humidity
25.00,76.00,99.00
39.00,65.00,70.00
59.00,45.00,77.00
...
```

This format of storing data is known as *comma-separated values* or CSV.

> **CSVs**: A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields. (Wikipedia)


To read this file into a numpy array, the `genfromtxt` function is used.

In [38]:
import urllib.request
# urllib.request.urlretrieve(url, filename) is used to download the file using the url to that file
urllib.request.urlretrieve(
    'https://gist.github.com/BirajCoder/a4ffcb76fd6fb221d76ac2ee2b8584e9/raw/4054f90adfd361b7aa4255e99c2e874664094cea/climate.csv',
    'climate.txt')

('climate.txt', <http.client.HTTPMessage at 0x7b2da4c0b850>)

In [39]:
import numpy as np
climate_data = np.genfromtxt('climate.txt', delimiter=',', skip_header= 1 ) #filename, how is the data splited(delimter)

In [40]:
climate_data

array([[25., 76., 99.],
       [39., 65., 70.],
       [59., 45., 77.],
       ...,
       [99., 62., 58.],
       [70., 71., 91.],
       [92., 39., 76.]])

In [41]:
climate_data.shape

(10000, 3)

In [42]:
weights = np.array([0.3, 0.2, 0.5])

In [43]:
yeilds = climate_data @ weights

In [44]:
yeilds

array([72.2, 59.7, 65.2, ..., 71.1, 80.7, 73.4])

In [45]:
yeilds.shape

(10000,)

Let's add the `yields` to `climate_data` as a fourth column using the [`np.concatenate`](https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html) function.

In [46]:
climate_results = np.concatenate((climate_data, yeilds.reshape(10000, 1)), axis = 1)

In NumPy, the `.reshape()` method is used to change the shape (dimensions) of a NumPy array without changing its data. This method allows you to reorganize the elements of an array into a different shape while ensuring that the total number of elements remains the same.

In [47]:
climate_results

array([[25. , 76. , 99. , 72.2],
       [39. , 65. , 70. , 59.7],
       [59. , 45. , 77. , 65.2],
       ...,
       [99. , 62. , 58. , 71.1],
       [70. , 71. , 91. , 80.7],
       [92. , 39. , 76. , 73.4]])

* Since we wish to add new columns, we pass the argument `axis=1` to `np.concatenate`. The `axis` argument specifies the dimension for concatenation.

*  The arrays should have the same number of dimensions, and the same length along each except the dimension used for concatenation. We use the [`np.reshape`](https://numpy.org/doc/stable/reference/generated/numpy.reshape.html) function to change the shape of `yields` from `(10000,)` to `(10000,1)`.


In [48]:
np.savetxt('climate_results.txt', # filename
           climate_results, # What to write
           fmt='%.2f',  # formatting -> two decimal points for every number
           delimiter=',', # Seperation parameter
           header='temperature,rainfall,humidity,yeild_apples', # What to display in the first line (header)
           )

## Arithmetic operations, broadcasting and comparison

Numpy arrays support arithmetic operators like `+`, `-`, `*`, etc. You can perform an arithmetic operation with a single number (also called scalar) or with another array of the same shape. Operators make it easy to write mathematical expressions with multi-dimensional arrays.

In [49]:
arr2 = np.array([[1, 2, 3, 4],
                 [5, 6, 7, 8],
                 [9, 1, 2, 3]])

In [50]:
arr3 = np.array([[11, 12, 13, 5],
                [14, 15, 16, 6],
                [17, 18, 19, 7]])

In [51]:
arr2 + arr3

array([[12, 14, 16,  9],
       [19, 21, 23, 14],
       [26, 19, 21, 10]])

In [52]:
# adding a scalar
arr2 + 8

array([[ 9, 10, 11, 12],
       [13, 14, 15, 16],
       [17,  9, 10, 11]])

In [53]:
# dividing by a scalar
arr3 / 2

array([[5.5, 6. , 6.5, 2.5],
       [7. , 7.5, 8. , 3. ],
       [8.5, 9. , 9.5, 3.5]])

In [54]:
arr2 * arr3

array([[ 11,  24,  39,  20],
       [ 70,  90, 112,  48],
       [153,  18,  38,  21]])

In [55]:
#modulus with a scalar
arr2 % 4

array([[1, 2, 3, 0],
       [1, 2, 3, 0],
       [1, 1, 2, 3]])

### Array Broadcasting

Numpy arrays also support *broadcasting*, allowing arithmetic operations between two arrays with different numbers of dimensions but compatible shapes. Let's look at an example to see how it works.

In [56]:
arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8],
       [9, 1, 2, 3]])

In [57]:
arr2.shape

(3, 4)

In [58]:
arr4 = np.array([4, 5, 6, 7])

In [59]:
arr4.shape

(4,)

In [60]:
arr2 + arr4

array([[ 5,  7,  9, 11],
       [ 9, 11, 13, 15],
       [13,  6,  8, 10]])

When the expression `arr2 + arr4` is evaluated, `arr4` (which has the shape `(4,)`) is replicated three times to match the shape `(3, 4)` of `arr2`. Numpy performs the replication without actually creating three copies of the smaller dimension array, thus improving performance and using lower memory.

Broadcasting only works if one of the arrays can be replicated to match the other array's shape.

### Array Comparison

Numpy arrays also support comparison operations like `==`, `!=`, `>` etc. The result is an array of booleans.

In [61]:
arr1 = np.array([[1, 2, 3], [3, 4, 5]])
arr2 = np.array([[2, 2, 3], [1, 2, 5]])

In [62]:
arr1 == arr2

array([[False,  True,  True],
       [False, False,  True]])

In [63]:
arr1 != arr2

array([[ True, False, False],
       [ True,  True, False]])

In [64]:
arr1 > arr2

array([[False, False, False],
       [ True,  True, False]])

In [65]:
arr1 >= arr2

array([[False,  True,  True],
       [ True,  True,  True]])

Array comparison is frequently used to count the number of equal elements in two arrays using the `sum` method. Remember that `True` evaluates to `1` and `False` evaluates to `0` when booleans are used in arithmetic operations.

In [66]:
(arr1 == arr2).sum() # gives the number of matching elements in arr1 and arr2

3

## Array indexing and slicing

Numpy extends Python's list indexing notation using `[]` to multiple dimensions in an intuitive fashion. You can provide a comma-separated list of indices or ranges to select a specific element or a subarray (also called a slice) from a Numpy array.

In [67]:
arr3 = np.array([
    [[11, 12, 13, 14],
     [13, 14, 15, 19]],

    [[15, 16, 17, 21],
     [63, 92, 36, 18]],

    [[98, 32, 81, 23],
     [17, 18, 19.5, 43]]])

In [68]:
arr3.shape

(3, 2, 4)

In [69]:
# Slicing a 2D array
arr3[1] # note: Indexing starts from 0

array([[15., 16., 17., 21.],
       [63., 92., 36., 18.]])

In [70]:
# Slicing a 1D array
arr3[1,1]

array([63., 92., 36., 18.])

In [71]:
# How to slice a single element
arr3[1,1,2]

36.0

In [72]:
# Subarray using ranges
arr3[1:, 0:1, :2] # elements of the outermost array, elements of the middle array, elements of the innermost array

array([[[15., 16.]],

       [[98., 32.]]])

In [73]:
arr3[1:, 0:1, :2].shape

(2, 1, 2)

In [74]:
# mixing indices and ranges
arr3[1:, 1, 3]

array([18., 43.])

In [75]:
arr3[1]

array([[15., 16., 17., 21.],
       [63., 92., 36., 18.]])

In [76]:
# All zeros
np.zeros((3,2)) # (3,2) is the shape of the array we want to create
# Also called null array/ matrix

array([[0., 0.],
       [0., 0.],
       [0., 0.]])

In [77]:
# All ones
np.ones([2, 2, 3]) # shape can be tuple or a list (Doesn't matter though)

array([[[1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.]]])

In [78]:
# Identity Matrix
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [79]:
# Random Vector
np.random.rand(6)

array([0.26935326, 0.07051884, 0.59924261, 0.19968322, 0.96219707,
       0.05640275])

In [80]:
# Random Matrix
np.random.randn(2, 3)

array([[ 1.28986161,  0.59865669,  1.65078895],
       [-0.22050317,  1.24124581, -1.76047901]])

`rand()` generates random values uniformly distributed in the range [0, 1), while `randn()` generates random values following a standard normal distribution (mean 0, standard deviation 1)

In [81]:
# Fixed Value
np.full([2, 3], 42)

array([[42, 42, 42],
       [42, 42, 42]])

In [82]:
# Range with start, end, step
np.arange(10, 90, 3)

array([10, 13, 16, 19, 22, 25, 28, 31, 34, 37, 40, 43, 46, 49, 52, 55, 58,
       61, 64, 67, 70, 73, 76, 79, 82, 85, 88])

In [83]:
np.arange(10, 90, 3).shape

(27,)

In [84]:
np.arange(10, 90, 3).reshape(3, 3, 3) # Reshaping the above result into a 3D matrix (3, 3, 3)

array([[[10, 13, 16],
        [19, 22, 25],
        [28, 31, 34]],

       [[37, 40, 43],
        [46, 49, 52],
        [55, 58, 61]],

       [[64, 67, 70],
        [73, 76, 79],
        [82, 85, 88]]])

In [85]:
# Equally spaced numbers in a range
np.linspace(3, 27, 9) # start, end (inclusive), total number of values

array([ 3.,  6.,  9., 12., 15., 18., 21., 24., 27.])