## notes about vectorization
also called array programming. pretty much everything is moved into the vector space, instead of doing operations on scalars we do operations on a vector at a time

we will be using numpy for quick inference. the basic `ndarray` stands for *n-dimensional array*. linear algebra techs then works as expected.

In [None]:
import numpy as np

a = np.array([1, 2, 3, 4, 5])

2 * a

array([ 2,  4,  6,  8, 10])

In [None]:
0.5 * a + 2

array([2.5, 3. , 3.5, 4. , 4.5])

In [None]:
# we can also create an array and then transform it to a matrix representation
a = np.arange(12).reshape(3, 4)
2 * a

array([[ 0,  2,  4,  6],
       [ 8, 10, 12, 14],
       [16, 18, 20, 22]])

In [None]:
a.mean(), np.mean(a)

(5.5, 5.5)

pandas equivalent

In [None]:
import pandas as pd

cols = ['a', 'b', 'c']
data = np.arange(15).reshape(5, 3)


rows = pd.date_range('2021-07-01', periods=5, freq='B') # generates 5 business days
rows

DatetimeIndex(['2021-07-01', '2021-07-02', '2021-07-05', '2021-07-06',
               '2021-07-07'],
              dtype='datetime64[ns]', freq='B')

In [None]:
df = pd.DataFrame(data, columns=cols, index=rows)
df

Unnamed: 0,a,b,c
2021-07-01,0,1,2
2021-07-02,3,4,5
2021-07-05,6,7,8
2021-07-06,9,10,11
2021-07-07,12,13,14


vectorization works pretty much the same as `numpy`, however, aggregation happens column-wise

In [None]:
2 * df # multiply all elements by 2

Unnamed: 0,a,b,c
2021-07-01,0,2,4
2021-07-02,6,8,10
2021-07-05,12,14,16
2021-07-06,18,20,22
2021-07-07,24,26,28


In [None]:
df.sum(), df.mean()

(a    30
 b    35
 c    40
 dtype: int64,
 a    6.0
 b    7.0
 c    8.0
 dtype: float64)

so each column can then be treated as a vector. we can access columns by `["col_name"]` notation or `df.col_name`

In [None]:
df["a"] + df["b"] # add column a to column b

2021-07-01     1
2021-07-02     7
2021-07-05    13
2021-07-06    19
2021-07-07    25
Freq: B, dtype: int64

In [None]:
0.5 * df.a + 2 * df.b - df.c

2021-07-01     0.0
2021-07-02     4.5
2021-07-05     9.0
2021-07-06    13.5
2021-07-07    18.0
Freq: B, dtype: float64

we can also get the result of boolean operations then use logical indexing

In [None]:
df[df['a'] > 5] # select rows where column a is greater than 5

Unnamed: 0,a,b,c
2021-07-05,6,7,8
2021-07-06,9,10,11
2021-07-07,12,13,14


In [None]:
# or similarly a comparison between two columns
0.15 * df.a + df.b > df.c

2021-07-01    False
2021-07-02    False
2021-07-05    False
2021-07-06     True
2021-07-07     True
Freq: B, dtype: bool

pretty much the advantage of this is we can use C-optimized for loops to iterate over these matrix maths instead of slower pythonic ones 