<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

# Mathematics Basics

**With `NumPy, pandas & PyTables`**

&copy; Dr. Yves J. Hilpisch | The Python Quants GmbH

http://tpq.io | [training@tpq.io](mailto:trainin@tpq.io) | [@dyjh](http://twitter.com/dyjh)

## Python and Big Data

Python per se is _not_ a Big Data technology. However, Python in combination with packages like `pandas` or `PyTables` allows the management and the analysis of quite large data sets.

For our purposes, we define Big Data as a (number of) **object(s)** and/or **data file(s)** that do(es) _not_ fit into the memory of a single computer (server, node, etc.) &mdash; whatever hardware you are using for data analytics. On such a data file, typical analytics and computational tasks, like counting, aggregation and selection shall be implemented.

## Large Scale Computation

Computation = Mathematics + Programming + Data

Large Scale Computation = Mathematics + Programming + Large Data Sets

## Out-of-Memory Analytics with NumPy

Sometimes operations on `NumPy ndarray` objects generate so many temporary objects that the available memory does not suffice to finish the desired operation. An example might be `a.dot(a.T)`, i.e. the dot product of an array `a` with iteself transposed.

Such an operation needs memory for **three arrays**: `a`, `a.T` and `a.dot(a.T)`. If the array `a` is sufficiently large, say 50% of the free memory, such an operation is impossible with the usual approach.

A solution is to work with **disk-based arrays** and to use **memory maps** of these arrays.

Some **imports** first and a check of the **free memory**.

In [None]:
!git clone https://github.com/tpq-classes/mathematics_basics.git
import sys
sys.path.append('mathematics_basics')


In [None]:
import psutil
import numpy as np
import pandas as pd
np.set_printoptions(suppress=True)
pd.set_option('display.float_format', lambda x: '%.4f' % x)

In [None]:
psutil.virtual_memory()

In [None]:
print('RAM % used:', psutil.virtual_memory()[2])

### Sample Data

We generate a larger `NumPy ndarray` object.

In [None]:
from numpy.random import default_rng

In [None]:
rng = default_rng(100)

In [None]:
m = 10000
n = 10000

In [None]:
%%time
a = rng.standard_normal((m, n))

In [None]:
a.nbytes

Checking **memory** again &ndash; and that the object (reference pointer) indeed **owns the data**.

In [None]:
psutil.virtual_memory()

In [None]:
a.flags.owndata
  # the object owns the in-memory data

Simple **operations** on the in-memory `ndarray` object.

In [None]:
a[:3, :3]
  # sample data

In [None]:
%time a.mean()
  # reductions work

Now save this object **to disk** ...

In [None]:
# path = '/Users/yves/Temp/data/' 
path = '../../../data/'  # needs to be adjusted

In [None]:
%time np.save(path + 'od', a)
  # save memory array to disk (SSD)
  # (can need less time than in-memory generation)

... and **delete** the in-memory object.

In [None]:
del a
  # delete the in-memory version
  # to free memory -- somehow ...
  # gc does not work "instantly"

In [None]:
psutil.virtual_memory()
  # garbage collection does not bring that much ...
  # memory usage has not changed significantly

### Memory Map of Data

Using the saved object, we generate a new `memmap` object.

In [None]:
od = np.lib.format.open_memmap(path + 'od.npy', dtype=np.float64, mode='r')
  # open memmap array with the array file as data

In [None]:
od.flags.owndata
  # object does not own the data

It mainly behaves the **same way** as in-memory `ndarray` objects behave.

In [None]:
od[:3, :3]
  # compare sample data

In [None]:
%time od.mean()
  # operations in NumPy as usual
  # somewhat slower of course ...

### Memory Maps of (Intermediate) Results

Major memory problems with `NumPy ndarray` objects generally arise due to **temporary arrays** needed to store intermediate results. We therefore generate `memmap` objects to store intermediate and final results.

First, for the **transpose of the array**.

In [None]:
tr = np.memmap(path + 'tr.npy', dtype=np.float64, mode='w+', shape=(n, m))
  # memmap object for transpose

In [None]:
%time tr[:] = od.T[:]
  # write transpose to disk

In [None]:
!ls -n $path

Second, for the **final results**.

In [None]:
re = np.memmap(path + 're.npy', dtype=np.float64, mode='w+', shape=(m, m))
  # memmap object for result

In [None]:
%time re[:] = od.dot(tr)[:]
  # store results on disk

### Final Look and Cleaning Up

Lots of data (`od + tr + re`) has been crunched/created without a real memory burden.

In [None]:
psutil.virtual_memory()

In [None]:
!ls -n $path

In [None]:
!rm $path*

### Using a Sub-Process

The `concurrent` module allows the use of a **separate sub-process** for callables.

In [None]:
import concurrent

In [None]:
def generate_array_on_disk(m, n):
    # memory inefficient operation
    a = rng.standard_normal((m, n))
    np.save(path + 'od.npy', a)

The use of such a sub-process makes sure that any memory used by the sub-process gets immediately freed after the sub-process is terminated. This leaves the **free memory of the current process** mainly unchanged. Avoids "unpredictable" behaviour of `Python` garbage collection.

In [None]:
psutil.virtual_memory()

In [None]:
%%time
with concurrent.futures.ThreadPoolExecutor() as subprocess:
    subprocess.submit(generate_array_on_disk, m, n).result()
  # separate sub-process is started, the callable is executed
  # the process with all its memory usage is killed

In [None]:
psutil.virtual_memory()
  # meanwhile memory was freed again

Final look and clean-up.

In [None]:
!ls -n $path

In [None]:
!rm $path*

## Processing (Too) Large CSV Files

We generate a CSV file on disk that is **too large** to fit into memory. We process this file with the help of `pandas` and `PyTables`.

First, some imports. 

In [None]:
import os
import numpy as np
import pandas as pd
import datetime as dt

### Generating an Example CSV File

Number of **rows** to be generated for random data set.

In [None]:
N = int(1e5)
N

Using both random **integers** as well as **floats**.

In [None]:
ran_int = rng.integers(0, 10000, size=(2, N))
ran_flo = rng.standard_normal((2, N))

Filename for **`csv` file**.

In [None]:
csv_name = path + 'data.csv'
csv_name

**Writing the data** row by row.

In [None]:
%%time
with open(csv_name, 'w') as csv_file:
    header = 'date,int1,int2,flo1,flo2\n'
    csv_file.write(header)
    for _ in range(20):
        # 20 times the original data set
        for i in range(N):
            row = '%s,%i,%i,%f,%f\n' % \
                    (dt.datetime.now(), ran_int[0, i], ran_int[1, i],
                                    ran_flo[0, i], ran_flo[1, i])
            csv_file.write(row)
        print('Size on disk:', os.path.getsize(csv_name))

**Excursion**: If only the numerical data is to be written to disk, using `np.savetext()` can be more efficient.

In [None]:
ran = np.vstack((ran_int, ran_flo)).T

In [None]:
ran[:3]

In [None]:
csv_name_ = path + 'data_.csv'
csv_name_

In [None]:
%%time
np.savetxt(csv_name_, ran, delimiter=',')  # just a single data set (not 20)

**Delete** the original `NumPy ndarray` objects.

In [None]:
del ran
del ran_int
del ran_flo

**Reading some rows** to check the content.

In [None]:
with open(csv_name, 'r') as csv_file:
    for _ in range(5):
        print(csv_file.readline(), end='')

In [None]:
#with open(csv_name_, 'r') as csv_file:
#    for _ in range(5):
#        print(csv_file.readline(), end='')

### Reading and Writing with pandas

The filename for the `pandas HDFStore`.

In [None]:
!ls -n $path

In [None]:
pd_name = path + 'data.h5p'

In [None]:
h5 = pd.HDFStore(pd_name, 'w')

`pandas` allows to read data from (large) files chunk-wise via a **file-iterator**.

In [None]:
it = pd.read_csv(csv_name, iterator=True, chunksize=N / 20)

Reading and storing the data **chunk-wise**.

In [None]:
%%time
for i, chunk in enumerate(it):
    h5.append('data', chunk)
    if i % 20 == 0:
        print('Size on disk:', os.path.getsize(pd_name))

The resulting `HDF5` file.

In [None]:
print(h5.info())

### Disk-Based Analytics with pandas

The **disk-based** `pandas DataFrame` mainly behaves like an **in-memory** object &ndash; but these operations are not memory efficient.

In [None]:
%time h5['data'].describe()

**Data selection and plotting** works as with regular `pandas DataFrame` objects &ndash; again not really memory efficient.

In [None]:
from pylab import plt
plt.style.use('seaborn-v0_8')
%config InlineBackend.figure_format = 'svg'

In [None]:
%time h5['data']['flo2'][0:N:1000].cumsum().plot();

In [None]:
h5.close()

The major reason is that the `DataFrame` **data structure is broken up** (e.g. columns) during storage. For analytics it has to be put together in-memory again.

In [None]:
import tables as tb

In [None]:
h5 = tb.open_file(path + 'data.h5p', 'r')

In [None]:
h5

In [None]:
h5.close()

### Reading with pandas and Writing with PyTables

The `PyTables` database file.

In [None]:
import tables as tb

In [None]:
tb_name = path + 'data.h5t'

In [None]:
h5 = tb.open_file(tb_name, 'w')

Using a **`rec array` object** of `NumPy` to provide the row description for the `PyTables` table. To this end, a **custom `dtype` object** is needed.

In [None]:
dty = np.dtype([('date', 'S26'), ('int1', '<i8'), ('int2', '<i8'),
                                 ('flo1', '<f8'), ('flo2', '<f8')])
  # change dtype for date from object to string

Adding **compression** to the mix (less storage, better backups, better data transfer, etc.).

In [None]:
filters = tb.Filters(complevel=2, complib='blosc')

Again **reading and writing chunk-wise**, this time appending to a `PyTables table` object.

In [None]:
it = pd.read_csv(csv_name, iterator=True, chunksize=N / 20)

In [None]:
%%time
tab = h5.create_table('/', 'data',
            np.array(it.read().to_records(index=False),
                     dty), filters=filters)
  # initialize table object by using first chunk and adjusted dtype
for chunk in it:
    tab.append(chunk.to_records(index=False))
tab.flush()

The resulting `table` object.

In [None]:
h5.get_filesize()

In [None]:
tab

### Out-of-Memory Analytics with PyTables

**Data on disk** can be used as if it would be both _in-memory_ and _uncompressed_. De-compression is done at run-time.

In [None]:
tab[N:N + 3]
  # slicing row-wise

In [None]:
tab[N:N + 3]['date']
  # access selected data points

**Counting** of rows is easily accomplished (although here not really needed).

In [None]:
%time len(tab[:]['flo1'])
  # length of column (object)

**Aggregation** operations, like summing up or calculating the mean value, are another application area.

In [None]:
%time tab[:]['flo1'].sum()
  # sum over column

In [None]:
%time tab[:]['flo1'].mean()
  # mean over column

Typical, `SQL`-like, **conditions and queries** can be added.

In [None]:
%time sum([row['flo2'] for row in tab.where('(flo1 > 3) & (int2 < 1000)')])
  # sum combined with condition

In [None]:
h5.close()

### Overview

All operations have been on data sets that do not fit (if uncompressed) into the memory of the machine they haven been implemented on.

In [None]:
!ls -n $path

Using compression of course reduces the size of the `PyTables table` object relative to the `csv` and the `pandas HDFStore` files. This might, in certain circumstances, lead to file sizes that would again fit in memory.

In [None]:
!rm $path*

## Conclusions

`Python` and packages like **`NumPy, pandas, PyTables`** provide useful means and approaches to circumvent the limitations of free memory on a single computer (node, server, etc.).

Key to the **performance of such out-of-memory operations** are mainly the storage hardware (speed/capacity), the data format used (e.g. `HDF5` vs.  relational databases) and in some scenarios also the use of performant compression algorithms.

Reading writing speed of **`SSD` hardware** is evolving fast:

* status quo: **3+GB/s** reading/writing (e.g. MacBook 2020)
* available: **6+GB/s** reading/writing (e.g. latest SSDs 2021)

Check out [Fastest SSD Drives](https://www.gamingpcbuilder.com/ssd-ranking-the-fastest-solid-state-drives/).

<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:training@tpq.io">training@tpq.io</a>