## Working with numerical data

The "data" in *Data Analysis* typically refers to numerical data, e.g., stock prices, sales figures, sensor measurements, sports scores, database tables, etc. The [Numpy](https://numpy.org) library provides specialized data structures, functions, and other tools for numerical computing in Python. Let's work through an example to see why & how to use Numpy for working with numerical data.


> Suppose we want to use climate data like the temperature, rainfall, and humidity to determine if a region is well suited for growing apples. A simple approach for doing this would be to formulate the relationship between the annual yield of apples (tons per hectare) and the climatic conditions like the average temperature (in degrees Fahrenheit), rainfall (in  millimeters) & average relative humidity (in percentage) as a linear equation.
>
> `yield_of_apples = w1 * temperature + w2 * rainfall + w3 * humidity`

We're expressing the yield of apples as a weighted sum of the temperature, rainfall, and humidity. This equation is an approximation since the actual relationship may not necessarily be linear, and there may be other factors involved. But a simple linear model like this often works well in practice.

Based on some statical analysis of historical data, we might come up with reasonable values for the weights `w1`, `w2`, and `w3`. Here's an example set of values:

In [1]:
w1, w2, w3  = 0.3, 0.2, 0.5

Given some climate data for a region, we can now predict the yield of apples. Here's some sample data:

<img src="https://i.imgur.com/TXPBiqv.png" style="width:360px;">

To begin, we can define some variables to record climate data for a region.

In [2]:
kanto_temp = 73
kanto_rainfall = 67
kanto_humidity = 43

In [3]:
kanto_yield_apples = kanto_temp*w1 + kanto_rainfall*w2 + kanto_humidity*w3

In [4]:
kanto_yield_apples

56.8

In [5]:
print('The expected yield of apples in kanto region is {} is tonnes per hectare.'.format(kanto_yield_apples))

The expected yield of apples in kanto region is 56.8 is tonnes per hectare.


To make it easier to perform the above computation for multiple regions, we can represent the climate data for each region as a vector,i.e a list of numbers

In [14]:
kanto = [73,67,43]
johto = [91,88,64]
hoenn = [87,134,58]
sinnoh = [102,43,37]
unova = [69,96,70]

The three numbers in each vector represent the temperature, rainfall, and humidity data, respectively. 

We can also represent the set of weights used in the formula as a vector.

In [17]:
weights= [w1,w2,w3]

In [20]:
def crop_yield(region, weights):
    result = 0
    for x,w in zip(region, weights):
        result += x*w
    return result

In [21]:
crop_yield(kanto, weights)

56.8

In [22]:
crop_yield(johto, weights)

76.9

In [23]:
crop_yield(unova, weights)

74.9

## Going from Python lists to Numpy arrays


The calculation performed by the `crop_yield` (element-wise multiplication of two vectors and taking a sum of the results) is also called the *dot product*. 
The Numpy library provides a built-in function to compute the dot product of two vectors. However, we must first convert the lists into Numpy arrays.

Let's install the Numpy library using the `pip` package manager.

In [24]:
import numpy as np

We can now use the `np.array` function to create Numpy arrays.

In [26]:
kanto = np.array([73,67,43])

In [27]:
kanto

array([73, 67, 43])

In [28]:
weights = np.array([w1,w2,w3])

In [29]:
weights

array([0.3, 0.2, 0.5])

In [30]:
type(kanto)

numpy.ndarray

In [31]:
type(weights)

numpy.ndarray

Just like lists,Numpy also supports the indexing notation [].

In [33]:
weights[0]

np.float64(0.3)

In [34]:
kanto[2]

np.int64(43)

## Operating on Numpy arrays

We can now compute the dot product of the two vectors using the `np.dot` function.

In [35]:
np.dot(kanto, weights)

np.float64(56.8)

We can achieve the same result with low-level operations supported by Numpy arrays: performing an element-wise multiplication and calculating the resulting numbers' sum.

In [36]:
(kanto*weights).sum()

np.float64(56.8)

The `*` operator performs an element-wise multiplication of two arrays if they have the same size. The `sum` method calculates the sum of numbers in an array.

In [37]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

In [38]:
arr1*arr2

array([ 4, 10, 18])

In [39]:
arr2.sum()

np.int64(15)

## Benefits of using Numpy arrays

Numpy arrays offer the following benefits over Python lists for operating on numerical data:

- **Ease of use**: You can write small, concise, and intuitive mathematical expressions like `(kanto * weights).sum()` rather than using loops & custom functions like `crop_yield`.
- **Performance**: Numpy operations and functions are implemented internally in C++, which makes them much faster than using Python statements & loops that are interpreted at runtime

Here's a comparison of dot products performed using Python loops vs. Numpy arrays on two vectors with a million elements each.

In [41]:
#Py lists
arr1 = list(range(1000000))
arr2 = list(range(1000000, 2000000))

#Numpy arrays
arr1_np = np.array(arr1)
arr2_np = np.array(arr2)

In [42]:
%%time
result = 0
for x1, x2 in zip(arr1, arr2):
    result += x1*x2
result

CPU times: total: 172 ms
Wall time: 236 ms


833332333333500000

In [43]:
%%time
np.dot(arr1_np,arr2_np)

CPU times: total: 0 ns
Wall time: 3.9 ms


np.int64(833332333333500000)

As you can see, using `np.dot` is 100 times faster than using a `for` loop. This makes Numpy especially useful while working with really large datasets with tens of thousands or millions of data points.

Let's save our work before continuing.