# Datalabframework

Minimal example and directory structure.

## Elements

This scaffolding works with three elements, which are co-ordinated with each other:

  - the python notebook you are reading (main.ipynb)
  - the datalabframework python package (datalabframework)
  - configuration files (metadata.yml, \__main__.py, Makefile)

## Principles ##

- ** Both notebooks and code are first citizens **

In the source directory `src` you will find all source code. In particular, both notebooks and code files are treated as source files. Source code is further partitioned and scaffolded in several directories to simplify and organize the data science project. Following python package conventions, the root of the project is tagged by a `__main__.py` file and directory contains the `__init__.py` code. By doing so, python and notebook files can reference each other.

Python notebooks and Python code can be mixed and matched, and are interoperable with each other. You can include function from a notebook to a python code, and you can include python files in a notebook. 

- ** Data Directories should not contain logic code **

Data can be located anywhere, on remote HDFS clusters, or Object Store Services exposed via S3 protocols etc. However, in general is a good practice to keep some (or all data, if possible) locally on the file system. 

Separating data, configuration and code is done by moving all configuration to metadata files. Metadata files make possible to define aliases for data resources, data services and spark configurations, and keeping the spark code tidy with no hardcoded parameters. 

- ** Decouple Code from Configuration **

Notebook and Code should be decoupled from both engine configurations and from data locations. All configuration is kept in `metadata.yml` yaml files. Multiple setups for test, exploration, production can be described  in the same `metadata.yml` file or in separate multiple files.

- ** Declarative Configuration **

Metadata files are responsible for the binding of data and engine configurations to the code. For instance all data in the code shouold be referenced by an alias, and storage and retrieval of data object and files should happen via a common API. The metadata yaml file, describes the providers for each data source as well as the mapping of data aliases to their corresponding data objects. 



## Project Template

The data science project is structured in a way to facilitate the deployment of the artifacts, and to switch from batch processing to live experimentation. The top level project is composed of the following items:

### Top level Structure

```
├── data
│   └── ascombe.csv
├── main.ipynb
├── __main__.py
├── Makefile
└── metadata.yml
```

## Data Lab Framework

In [1]:
import datalabframework as dlf

### Package things
Package version: package variables `version_info`, `__version__`

In [2]:
dlf.version_info

(0, 5, 4)

In [3]:
dlf.__version__

'0.5.4'

Check is the datalabframework is loaded in the current python context

In [4]:
try:
    __DATALABFRAMEWORK__
    print("the datalabframework is loaded")
except NameError:
    print("the datalabframework is not loaded")

the datalabframework is loaded


In [5]:
#list of modules loaded as `from datalabframework import * ` 
dlf.__all__

['logging', 'notebook', 'project', 'params', 'data', 'engines']

### Modules: project

Project is all about setting the correct working directories where to run and find your notebooks, python files and configuration files.   
When the datalabframework is imported, it starts by searching for a `__main__.py` file, according to python module file naming conventions.   
All modules and alias paths are all relative to this project root path.

In [6]:
# root path for the project
dlf.project.rootpath()

'/home/jovyan/demo/src'

In [7]:
# this notebook
dlf.project.filename()

'main.ipynb'

In [8]:
# current workdir
dlf.project.workdir()

'/home/jovyan/demo/src'

In [9]:
# profile - as defined in metadata
dlf.project.profile()

'default'

In [10]:
#project info, and current notebook
dlf.project.info()

{'profile': 'default',
 'filename': 'main.ipynb',
 'rootpath': '/home/jovyan/demo/src',
 'workdir': '/home/jovyan/demo/src'}

In [11]:
#all notebooks in this project
dlf.project.notebooks()

['main.ipynb', 'hello.ipynb']

### Modules: Params

Configuration is declared in metadata. Metadata is accumulated starting from the rootpath, and metadata files in submodules are merged all up together.

In [12]:
metadata = dlf.params.metadata()
dlf.utils.pretty_print(metadata)

engines:
  spark-local:
    config:
      jobname: default
      master: local[*]
    context: spark
loggers:
  stream:
    enable: true
    severity: info
profile: default
providers:
  local_filesystem:
    format: csv
    path: ../data
    read:
      options:
        header: true
        inferSchema: true
    service: local
    write:
      options:
        header: true
        mode: overwrite
resources:
  .ascombe:
    path: ascombe.csv
    provider: local_filesystem
  .correlation:
    path: correlation.csv
    provider: local_filesystem
variables: {}



Data resources are relative to the `rootpath`. Next to the resources we can declare `providers` and `engines`. More about data binding in the next section.

### Modules: Data

Data binding works with the metadata files. It's a good practice to declare the actual binding in the metadata and avoiding hardcoding the paths in the notebooks and python source files.

`dlf.data.uri()` provide the uri, a unique identifier for the given resource,   
where . denotes how deep in the directory structure the data is located with respect to the project rootpath

In [13]:
dlf.data.uri('ascombe')

'.ascombe'

In [14]:
dlf.data.metadata('ascombe')

{'resource': {'path': 'ascombe.csv', 'provider': 'local_filesystem'},
 'provider': {'format': 'csv',
  'path': '../data',
  'read': {'options': {'header': True, 'inferSchema': True}},
  'service': 'local',
  'write': {'options': {'header': True, 'mode': 'overwrite'}}}}

### Modules: Notebook

This submodules contains a set of utilies to extract info from notebooks.

In [15]:
dlf.notebook.statistics('hello.ipynb')

{'cells': 8,
 'markdown': 3,
 'code': 5,
 'ename': None,
 'evalue': None,
 'executed': 3}

### Modules: Engines

This submodules will allow you to start a context, from the configuration described in the metadata. It also provide, basic load/store data functions according to the aliases defined in the configuration.

Let's start by listing the aliases and the configuration of the engines declared in `metadata.yml`.


In [16]:
# get the aliases of the engines

metadata = dlf.params.metadata()
dlf.utils.pretty_print(metadata['engines'])

spark-local:
  config:
    jobname: default
    master: local[*]
  context: spark



__Context: Spark__  
Let's start the engine session, by selecting a spark context from the list. Your can have many spark contexts declared, for instance for single node 

In [17]:
import datalabframework as dlf
engine = dlf.engines.get('spark-local')

PYSPARK_SUBMIT_ARGS:  pyspark-shell


You can quickly inspect the properties of the context by calling the `info()` function

In [18]:
engine.info

{'name': 'spark-local',
 'context': 'spark',
 'config': {'jobname': 'default', 'master': 'local[*]'}}

By calling the `context` method, you access the Spark SQL Context directly. The rest of your spark python code is not affected by the initialization of your session with the datalabframework.

In [19]:
spark = engine.context()

#print out name and version
'{}:{}'.format(engine.info['context'], spark.sparkSession.version)

'spark:2.3.1'

Once again, let's read the csv data again, this time using the spark context. First using the engine `write` utility, then directly using the spark context and the `dlf.data.path` function to localize our labeled dataset.

In [20]:
#read using the engine utility
df = engine.read('.ascombe')

repartition  None
coalesce  None
cache False
file:///home/natbusa/Projects/datalabframework-demos/demos/minimal/demo/src/../data/ascombe.csv


In [21]:
df.printSchema()

root
 |-- idx: integer (nullable = true)
 |-- Ix: double (nullable = true)
 |-- Iy: double (nullable = true)
 |-- IIx: double (nullable = true)
 |-- IIy: double (nullable = true)
 |-- IIIx: double (nullable = true)
 |-- IIIy: double (nullable = true)
 |-- IVx: double (nullable = true)
 |-- IVy: double (nullable = true)



In [22]:
df.show()

+---+----+-----+----+----+----+-----+----+----+
|idx|  Ix|   Iy| IIx| IIy|IIIx| IIIy| IVx| IVy|
+---+----+-----+----+----+----+-----+----+----+
|  0|10.0| 8.04|10.0|9.14|10.0| 7.46| 8.0|6.58|
|  1| 8.0| 6.95| 8.0|8.14| 8.0| 6.77| 8.0|5.76|
|  2|13.0| 7.58|13.0|8.74|13.0|12.74| 8.0|7.71|
|  3| 9.0| 8.81| 9.0|8.77| 9.0| 7.11| 8.0|8.84|
|  4|11.0| 8.33|11.0|9.26|11.0| 7.81| 8.0|8.47|
|  5|14.0| 9.96|14.0| 8.1|14.0| 8.84| 8.0|7.04|
|  6| 6.0| 7.24| 6.0|6.13| 6.0| 6.08| 8.0|5.25|
|  7| 4.0| 4.26| 4.0| 3.1| 4.0| 5.39|19.0|12.5|
|  8|12.0|10.84|12.0|9.13|12.0| 8.15| 8.0|5.56|
|  9| 7.0| 4.82| 7.0|7.26| 7.0| 6.42| 8.0|7.91|
| 10| 5.0| 5.68| 5.0|4.74| 5.0| 5.73| 8.0|6.89|
+---+----+-----+----+----+----+-----+----+----+



Finally, let's calculate the correlation for each set I,II, III, IV between the `x` and `y` columns and save the result on an separate dataset.

In [23]:
from pyspark.ml.feature import VectorAssembler

for s in ['I', 'II', 'III', 'IV']:
    va = VectorAssembler(inputCols=[s+'x', s+'y'], outputCol=s)
    df = va.transform(df)
    df = df.drop(s+'x', s+'y')
    
df.show()

+---+------------+-----------+------------+-----------+
|idx|           I|         II|         III|         IV|
+---+------------+-----------+------------+-----------+
|  0| [10.0,8.04]|[10.0,9.14]| [10.0,7.46]| [8.0,6.58]|
|  1|  [8.0,6.95]| [8.0,8.14]|  [8.0,6.77]| [8.0,5.76]|
|  2| [13.0,7.58]|[13.0,8.74]|[13.0,12.74]| [8.0,7.71]|
|  3|  [9.0,8.81]| [9.0,8.77]|  [9.0,7.11]| [8.0,8.84]|
|  4| [11.0,8.33]|[11.0,9.26]| [11.0,7.81]| [8.0,8.47]|
|  5| [14.0,9.96]| [14.0,8.1]| [14.0,8.84]| [8.0,7.04]|
|  6|  [6.0,7.24]| [6.0,6.13]|  [6.0,6.08]| [8.0,5.25]|
|  7|  [4.0,4.26]|  [4.0,3.1]|  [4.0,5.39]|[19.0,12.5]|
|  8|[12.0,10.84]|[12.0,9.13]| [12.0,8.15]| [8.0,5.56]|
|  9|  [7.0,4.82]| [7.0,7.26]|  [7.0,6.42]| [8.0,7.91]|
| 10|  [5.0,5.68]| [5.0,4.74]|  [5.0,5.73]| [8.0,6.89]|
+---+------------+-----------+------------+-----------+



After assembling the dataframe into four sets of 2D vectors, let's calculate the pearson correlation for each set. In the case the the ascombe sets, all sets should have the same pearson correlation.

In [24]:
from pyspark.ml.stat import Correlation
from pyspark.sql.types import DoubleType

corr = {}
cols = ['I', 'II', 'III', 'IV']

# calculate pearson correlations
for s in cols:
    corr[s] = Correlation.corr(df, s, 'pearson').collect()[0][0][0,1].item()

# declare schema
from pyspark.sql.types import StructType, StructField, FloatType
schema = StructType([StructField(s, FloatType(), True) for s in cols])

# create output dataframe
corr_df = spark.createDataFrame(data=[corr], schema=schema)

In [25]:
import pyspark.sql.functions as f
corr_df.select([f.round(f.avg(c), 3).alias(c) for c in cols]).show()

+-----+-----+-----+-----+
|    I|   II|  III|   IV|
+-----+-----+-----+-----+
|0.816|0.816|0.816|0.817|
+-----+-----+-----+-----+



Save the results. It's a very small data frame, however Spark when saving  csv format files, assumes large data sets and partitions the files inside an object (a directory) with the name of the target file. See below:


In [26]:
engine.write(corr_df,'correlation')

repartition  None
coalesce  None
cache False
file:///home/natbusa/Projects/datalabframework-demos/demos/minimal/demo/src/../data/correlation.csv


We read it back to chack all went fine

In [27]:
engine.read('correlation').show()

repartition  None
coalesce  None
cache False
file:///home/natbusa/Projects/datalabframework-demos/demos/minimal/demo/src/../data/correlation.csv
+---------+---------+----------+----------+
|        I|       II|       III|        IV|
+---------+---------+----------+----------+
|0.8164205|0.8162365|0.81628674|0.81652147|
+---------+---------+----------+----------+



### Modules: Export

This submodules will allow you to export cells and import them in other notebooks as python packages. Check the notebook [hello.ipynb](hello.ipynb), where you will see how to export the notebook, then follow the code here below to check it really works!


In [28]:
from hello import hi

importing Jupyter notebook from hello.ipynb


In [29]:
hi()

Hi World!
