<a href="https://colab.research.google.com/github/letianzj/QuantResearch/blob/master/notebooks/python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Some advanced topics of Python

- [Numpy](#numpy)
- [Pandas](#pandas)
- [Other](#other)
- [Reference](#reference)

In [7]:
%matplotlib inline
import numpy as np
import pandas as pd

## Numpy <a name="numpy"></a>


Python integer is not a pure integer. The PyOject_HEAD contains a reference count, the type and size information.

![image](https://github.com/letianzj/QuantResearch/blob/master/notebooks/img/cint_vs_pyint.png?raw=1)


Similarly, a Python list is more than just a list. The items is a pointer to a block of pointers, each of which in turn points to a full Python object like the Python integer. It is called dynamic-type. In contrast, numpy array is of fixed type where a gorup of homogeneous data are saved in a contiguous segment. 


![image](https://github.com/letianzj/QuantResearch/blob/master/notebooks/img/array_vs_list.png?raw=1)


- [Basic Attributes](#numpy_attributes)
- [Indexing and Slicing](#numpy_indexing)
- [ufunc](#numpy_ufunc)

### Basic Attributes <a name="numpy_attributes"></a>

In [20]:
x = np.full((3, 2), [1, 2], dtype=np.int32)
x

array([[1, 2],
       [1, 2],
       [1, 2]], dtype=int32)

In [21]:
# 3 rows and 2 columns; size = 2x3 = 6; each int32 item has 4 bytes or 4x8=32bits
# Jump 2 items or 2x4=8 bytes to the next row; Jump 1 item or 4 bytes to the next column
print(x.shape, x.size, x.itemsize, x.strides)
print(x.__array_interface__['data'])         # memory address

(3, 2) 6 4 (8, 4)
(27744320, False)


In [22]:
# A reshape does not make a copy; it is a view by altering shape and strides
y = x.reshape(2,3)
print(y.shape, y.size, y.itemsize, y.strides)
print(y.__array_interface__['data'])
# So is a transpose
print((x.T).__array_interface__['data'])

(2, 3) 6 4 (12, 4)
(27744320, False)
(27744320, False)


In [23]:
# Numpy will force a copy if it becomes non-contiguous.
xt_flat = (x.T).reshape(6,)
print(xt_flat.shape, xt_flat.size, xt_flat.itemsize, xt_flat.strides)
print(xt_flat.__array_interface__['data'])

(6,) 6 4 (4,)
(25212672, False)


### Indexing and Slicing<a name="numpy_indexing"></a>

In [24]:
# indexing
print(x[0,0])

1


In [25]:
# slicing x[start:stop:step] is a non-copy view; by modifying the shape and strides
y = x[0:2, 0]
print(y.__array_interface__['data'][0], x.__array_interface__['data'][0])   # same memory address
y[0] = 10
print(x)    # x changes
y = np.array([5, 5])  # y points to another array, so x does not change
print(x)

27744320 27744320
[[10  2]
 [ 1  2]
 [ 1  2]]
[[10  2]
 [ 1  2]
 [ 1  2]]


In [26]:
# Boolean mask
np.sum(x < 2, axis=0)

array([2, 0])

In [27]:
# fancy indexing
row = np.array([0, 1, 2])
col = np.array([1, 0, 0])
y = x[row, col]
print(y)     # three elements: (0,1), (1,0), (2,0)
# fancy indexing returns a copy
y[2] = 5
print(x)       # x does not change

[2 1 1]
[[10  2]
 [ 1  2]
 [ 1  2]]


### Computation

In [38]:
M1 = np.array([[1,2,3], [4,5,6], [7,8,9]])
M2 = np.array([[11,12,13], [14,15,16], [17,18,19]])

In [39]:
M1[0, :]           # by row

array([1, 2, 3])

In [40]:
V1 = np.array([1,2,3])
V1.shape

(3,)

In [41]:
V2 = V1.reshape([-1,1])
V2.shape

(3, 1)

In [42]:
M1 * M2       # element-wise

array([[ 11,  24,  39],
       [ 56,  75,  96],
       [119, 144, 171]])

In [43]:
M1 * V1       # V[0] * first col of M, V[1] * second col of M, ...

array([[ 1,  4,  9],
       [ 4, 10, 18],
       [ 7, 16, 27]])

In [44]:
V1 * M1       # same as M1 * V1

array([[ 1,  4,  9],
       [ 4, 10, 18],
       [ 7, 16, 27]])

In [45]:
M1 * V2       # V[0] * first row of M, V[1] * second row of M, ...

array([[ 1,  2,  3],
       [ 8, 10, 12],
       [21, 24, 27]])

In [46]:
V2 * M1       # same as M1 * V2

array([[ 1,  2,  3],
       [ 8, 10, 12],
       [21, 24, 27]])

In [47]:
np.dot(M1, V1)         # (3, 3)x(3,) => (3,), as if it was (3,1) except for the shape of result

array([14, 32, 50])

In [48]:
np.dot(V1, M1)         # (3,)x(3, 3) => (3,), as if it was (3,1) except for the shape of result

array([30, 36, 42])

In [49]:
np.dot(M1, V2)         # (3,3)x(3,1) => (3,1)

array([[14],
       [32],
       [50]])

In [51]:
np.dot(V2, M1)        # (3,1)x(3,3) => error

In [52]:
M1 @ M2         # @ == np.dot; except for pd.Dataframe where the result is a pd.Dataframe

array([[ 90,  96, 102],
       [216, 231, 246],
       [342, 366, 390]])

In [53]:
M2 @ M1

array([[150, 186, 222],
       [186, 231, 276],
       [222, 276, 330]])

In [54]:
np.dot(M1, M2)

array([[ 90,  96, 102],
       [216, 231, 246],
       [342, 366, 390]])

In [55]:
np.dot(M2, M1)

array([[150, 186, 222],
       [186, 231, 276],
       [222, 276, 330]])

For Dataframe, see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.div.html

### ufunc<a name="numpy_ufunc"></a>

Python loop is interpretive therefore slow. By using NumPy's universal functions (ufuncs), you claim that the loop can be vectorized and pushed into the compiled layer.

In [28]:
big_array = np.random.rand(1_000_000)
%timeit sum(big_array)
%timeit np.sum(big_array)

10 loops, best of 5: 85.4 ms per loop
1000 loops, best of 5: 382 µs per loop


In [29]:
# broadcasting
y = np.ones((3,1))
x+y     # y is broadcasted

array([[11.,  3.],
       [ 2.,  3.],
       [ 2.,  3.]])

ufunc advanced: aggregation

![image](https://github.com/letianzj/QuantResearch/blob/master/notebooks/img/map_filter_reduce.jpg?raw=1)

In [30]:
# map
def square(x):
    return x*x

x = [1, 2, 3, 4, 5]
print(list(map(square, x)))

vfunc = np.vectorize(square)
print(vfunc(x))

[1, 4, 9, 16, 25]
[ 1  4  9 16 25]


In [31]:
# filter
def check_even(number):
    if number % 2 == 0:
          return True  

    return False

even_numbers_iterator = filter(check_even, x)
print(list(even_numbers_iterator))

x = np.array(x)
mask = x%2 ==0
print(x[mask])

[2, 4]
[2 4]


In [32]:
# reduce
np.add.reduce(x)

15

## Pandas <a name="pandas"></a>

Pandas Series is an indexed numpy array. Index is immutable and ordered (but not necessarily sorted).

DataFrame consists of multiple column-oriented series that share an index. Each column has a dtype.

- [Grouping](#pandas_grouping)
- [Apply](#pandas_apply)
- [Merge](#pandas_merge)

In [33]:
np.random.seed(0)
df_orders = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=['BS', 'B', 'O', 'OS'])
df_orders['Sym'] = np.random.choice(['AAPL', 'AMZN'], 100, replace=True)
df_orders.index = pd.date_range("2020-01-01 09:30:00", periods=100, freq='ns', tz='US/Eastern')

df_trades = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=['P', 'S'])
df_trades['Sym'] = np.random.choice(['AAPL', 'AMZN'], 100, replace=True)
df_trades.index = pd.date_range("2020-01-01 09:30:00", periods=100, freq='ns', tz='US/Eastern') + pd.Timedelta(2, 'N')

df_trades.head(5)

Unnamed: 0,P,S,Sym
2020-01-01 09:30:00.000000002-05:00,10,94,AAPL
2020-01-01 09:30:00.000000003-05:00,91,43,AAPL
2020-01-01 09:30:00.000000004-05:00,63,31,AAPL
2020-01-01 09:30:00.000000005-05:00,20,70,AMZN
2020-01-01 09:30:00.000000006-05:00,9,60,AAPL


In [34]:
t1 = df_trades.index[0] + pd.Timedelta(1, 'N')
t2 = t1 + pd.Timedelta(3, 'N')
df_trades[t1:t2]       # slicing by index

Unnamed: 0,P,S,Sym
2020-01-01 09:30:00.000000003-05:00,91,43,AAPL
2020-01-01 09:30:00.000000004-05:00,63,31,AAPL
2020-01-01 09:30:00.000000005-05:00,20,70,AMZN
2020-01-01 09:30:00.000000006-05:00,9,60,AAPL


In [35]:
df_trades[1:3]         # slicing by row number

Unnamed: 0,P,S,Sym
2020-01-01 09:30:00.000000003-05:00,91,43,AAPL
2020-01-01 09:30:00.000000004-05:00,63,31,AAPL


In [36]:
df_trades[(df_trades.S > 20) & (df_trades.S < 25)]     # masking

Unnamed: 0,P,S,Sym
2020-01-01 09:30:00.000000014-05:00,39,24,AMZN
2020-01-01 09:30:00.000000034-05:00,7,21,AMZN
2020-01-01 09:30:00.000000094-05:00,77,21,AAPL
2020-01-01 09:30:00.000000097-05:00,25,21,AAPL


### Grouping

groupby: split/apply/combine

## Reference <a name="reference"></a>

* [Numpy QuickStart](https://numpy.org/doc/stable/user/quickstart.html)

* [Pandas User Guide](https://pandas.pydata.org/docs/user_guide/index.html)

* [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)