In [3]:
# Boilerplate
%matplotlib inline

# Intel DAAL related imports
from daal.data_management import HomogenNumericTable
import sys, os

sys.path.append(os.path.realpath('../3-custom-modules'))
from customUtils import getArrayFromNT

# Import numpy, matplotlib, seaborn
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

# Plotting configurations
%config InlineBackend.figure_format = 'retina'
plt.rcParams["figure.figsize"] = (12, 9)

# Data Management in PyDAAL

### Tutorial brief
As a high performance data analytics library for Python, PyDAAL has a set of data structures specifically designed to be performance oriented, while still versatile enough to accommodate data of different memory layouts. These data structures are centered around `NumericTable`, a generic data type for representing data in memory. In this section, we first learn the general concept of `NumericTable`. We then focus on two most important variants of `NumericTables`: `HomogenNumericTable` for homogenuous dense data, and `CSRNumericTable` for sparse data.

It is critical for PyDAAL to be able to work seamlessly with other mathematic and statistical Python packages, such as NumPy, SciPy, Pandas, scikit-learn, etc. These packages are being widely used in the mainstream Python data analytics community. And the goal of PyDAAL is to provide high performance alternatives to some of the algorithms that these popular packages offer. In this section we illustrate, using several simple examples, how PyDAAL can work with the data types in these packages.

### Learning objectives
* To learn `NumericTable`, the central concepts and main data types for data management in PyDAAL.
* To get familar with the `HomogenNumericTable` and the `CSRNumericTable` API.
* To see how `NumericTables` interact with data types in NumPy, SciPy, Pandas, etc.

### NumericTables
A conceptual model about data in data analytics is a 2-dimensional structure with each row being an _observation_ (_sample_), and each column being a _feature_ (_variable_). 

![](https://software.intel.com/sites/products/documentation/doclib/daal/daal-user-and-reference-guides/daal_prog_guide/GUID-65FAD60A-A92A-460F-B43D-4F8C2C39F662-low.png "Dataset")

`NumericTables` in DAAL are modeled after this concept. Every algorithm in DAAL takes `NumericTables` as input and produces `NumericTables` as output. There are several kinds of `NumericTables`, for example,
* **`HomogenNumericTable`** - This is a type for storing dense data where all featuers are of the same type. Supported types include `int`, `float32`, and `float64`. A `HomogenNuericTable` has the C-contiguous memory layout, that is, rows are laid out in contiguously in memory. It is essentially the same as a 2D matrix.

* **`CSRNumericTable`** - This is a type for storing sparse data where all features are of the same type. It is equivalent to a CSR sparse matrix. The CSR format is the most used memory storage format for sparse matrices. `CSRNumericTable` in PyDAAL is compatible with `scipy.sparse.csr_matrix`. 
![](https://software.intel.com/sites/products/documentation/doclib/daal/daal-user-and-reference-guides/daal_prog_guide/GUID-B89DE139-3E29-41DA-AB45-BB0B655716C3-low.png "CSR 0-based indexing")

![](https://software.intel.com/sites/products/documentation/doclib/daal/daal-user-and-reference-guides/daal_prog_guide/GUID-F488A72A-68BB-4E64-9D46-9C5FFAD0D431-low.png "CSR 1-based indexing")

* **`AOSNumericTable`** - This table is to represent heterogenuous data, that is,  features (columns) in the table can be of different data types. This table uses the row-majored memory layout, rows are stored in contiguous memory blocks.
![](https://software.intel.com/sites/products/documentation/doclib/daal/daal-user-and-reference-guides/daal_prog_guide/GUID-F0B9F856-5C57-4AE0-972E-8E0B70F3BDA4-low.png "AOSNumericTable")

* **`SOANumericTable`** - Another type of table to represent heterogenuous data. But this one uses the column-majored memory layout.
![](https://software.intel.com/sites/products/documentation/doclib/daal/daal-user-and-reference-guides/daal_prog_guide/GUID-02052873-BCB8-44CD-A506-7270567D79F7-low.png "SOANumericTable")

After discussions of concepts, we are now interested in putting `NumericTables` into action. In particular, we are interested in learning how to interact with the data types of other Python numeric packages. The following examples use `HomogenNumericTable` or `CSRNumericTable`. But the principles carry over to other types of `NumericTable`. 

### Interoperability with NumPy ndarrays
NumPy ndarray is the common denominator in many numeric packages. SciPy, Pandas, scikit-learn, and plotting tools such as matplotlib can either work directly with ndarrays, or have data types built on top of ndarrays. The code below shows how to easily convert an ndarray to a `HomogenNumericTable`. It's worth to stress that 

<p style="color:red"><strong>This works only if the ndarray is C-contiguous</strong></p>

In [None]:
import numpy as np
from daal.data_management import HomogenNumericTable

# The reshape is necessary because HomogenNumericTable constructor only takes array with fully defined dimensions. 
x = np.array([1., 2., 3., 4., 5., 6.]).reshape(1, 6)
x_nt = HomogenNumericTable(x)
print(x_nt.getNumberOfRows(), x_nt.getNumberOfColumns())

y_nt = HomogenNumericTable(x.reshape(6, 1))
print(y_nt.getNumberOfRows(), y_nt.getNumberOfColumns())

z_nt = HomogenNumericTable(x.reshape(2, 3))
print(z_nt.getNumberOfRows(), z_nt.getNumberOfColumns())

s = x.reshape(2, 3)
s_slice = s[:, :-1]
print(s_slice.flags['C'])

# DON'T DO THIS. s_slice is not C-contiguous!
# bad_nt = HomogenNumericTable(s_slice)

Going from a HomogenNumericTable to an ndarray is also possible, see below. The operation is so common that we've defined a function `getArrayFromNT` in [customUtils](../3-custom-modules/customUtils/__init__.py) based on the same logic. You can use this function for the rest of the lab.

In [None]:
from daal.data_management import BlockDescriptor_Float64, readOnly

bd = BlockDescriptor_Float64()
z_nt.getBlockOfRows(0, z_nt.getNumberOfRows(), readOnly, bd)
z = bd.getArray()
z_nt.releaseBlockOfRows(bd)
print(z)

### Example: Load data from a file
We often need to get data from a file, typically a file of the CSV format. It's noteworthy that PyDAAL provides data source connectors that can read data from a CSV file. However, more than often than not, NumPy's `genfromtxt` function just works like a charm. 

Example below reads the first 5 rows from a data file, and excludes the first column (column index 0).

In [None]:
data = np.genfromtxt('./mldata/wine.data', dtype=np.double, delimiter=',', usecols=list(range(1, 14)), max_rows=5)
print(data.flags['C'])
data_nt = HomogenNumericTable(data)
print(data_nt.getNumberOfRows(), data_nt.getNumberOfColumns())

### Example: Pandas DataFrames
Pandas DataFrames can be converted to ndarrays, and then to `NumericTables`. We can also go the other direction through ndarrays, see example below. We can import `getArrayFromNT` from [customUtils](../3-custom-modules/customUtils/__init__.py) 

In [None]:
import pandas as pd
sys.path.append(os.path.realpath('../3-custom-modules'))
from customUtils import getArrayFromNT

df = pd.DataFrame(np.random.randn(10, 5), columns = ['a', 'b', 'c', 'd', 'e'])
array = df.values
print(array.flags['C'])
print(array.shape)

array_nt = HomogenNumericTable(array)
print(array_nt.getNumberOfRows(), array_nt.getNumberOfColumns())

d = getArrayFromNT(array_nt)
df2 = pd.DataFrame(d, columns = ['a', 'b', 'c', 'd', 'e'])
print(df2)

### Example: scikit-learn datasets
Scikit-learn has some functions to load popular datasets on the Internet. These datasets are available through [sklearn.datasets](http://scikit-learn.org/stable/datasets). For example, the [load_digits](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits) method loads and returns the digits dataset. Because the dataset internally uses NumPy ndarray to store information, we can convert it to DAAL `NumericTables`, and pass them to DAAL algorithms. 

Extreme caution must be taken, however, because sometimes the data loaded is not C-contiguous. We need to make it right before constructing a `NumericTable` from the data. The code below shows how it works.

In [None]:
from sklearn.datasets import load_digits

digits = load_digits()
print(digits.data.flags['C'])
# digits.data is NOT C-contiguous. We need to make it into the C-contiguous memory layout.
data = np.ascontiguousarray(digits.data, dtype = np.double)
data_nt = HomogenNumericTable(data[-100:])
print(data_nt.getNumberOfRows(), data_nt.getNumberOfColumns())

### Example: SciPy sparse matrix
The last example illustrates `CSRNumericTable`, which is essentially a sparse matrix of the CSR storeage format. The CSR format uses three 1D arrays to represent a sparse matrix:
* `values` - All non-zero values are lumped into a dense array.
* `col_ind` - An array of column indices for non-zero values.
* `row_offset` - An array whose $i$-th element is the index in the `data` array for the value corresponding to the first non-zero element of the $i$-th row of the matrix. The last element of this array equals to _nnz_, the number of non-zeros.

`CSRNumericTable` can be converted to and from [`scipy.sparse.csr_matrix`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix). However, scipy uses 0-based indexing while DAAL uses 1-based indexing. Hence, the values of the two index arrays need to be incrememented before giving them to DAAL. The code below shows how to convert from a SciPy sparse matrix to a `CSRNumericTable`. One peculiar thing to note when constructing a `CSRNumericTable` is that, the indices arrays (`col_ind` and `row_offset`) must be 64-bit integers. 

In [None]:
from scipy.sparse import csr_matrix
from daal.data_management import CSRNumericTable

# First, create a sparse matrix
values = np.array([2.0, 6.4, 1.7, 3.1, 2.2, 2.1, 3.8, 5.5])
col_ind = np.array([0, 2, 5, 3, 1, 4, 5, 6])
row_offset = np.array([0, 3, 4, 4, 6, 8])
sp = csr_matrix((values, col_ind, row_offset), dtype=np.double, shape=(5, 7))
print(sp.toarray())

# Then, create a CSRNumericTable based on the sparse matrix
sp_nt = CSRNumericTable(sp.data, sp.indices.astype(np.uint64) + 1, sp.indptr.astype(np.uint64) + 1, 7, 5)
print(sp_nt.getNumberOfRows(), sp_nt.getNumberOfColumns())
(values, col_ind, row_offset) = sp_nt.getArrays()
print(getArrayFromNT(sp_nt))

### Summary
We learned the central concept of data management in PyDAAL: `NumericTables`. We got a glimpse of 4 types of `NumericTables` supported in DAAL. We practiced basic operations of `HomogenNumericTable` and `CSRNumericTable`, and their interoperability with NumPy, SciPy, Pandas, and scikit-learn.