# Datalabframework

Minimal example and directory structure.

## Elements

This scaffolding works with three elements, which are co-ordinated with each other:

  - the python notebook you are reading (main.ipynb)
  - the datalabframework python package (datalabframework)
  - configuration files (metadata.yml, \__main__.py, Makefile)

## Principles ##

 - **Both notebooks and code are first citizens**

In the source directory `src` you will find all source code. In particular, both notebooks and code files are treated as source files. Source code is further partitioned and scaffolded in several directories to simplify and organize the data science project. Following python package conventions, the root of the project is tagged by a `__main__.py` file and directory contains the `__init__.py` code. By doing so, python and notebook files can reference each other.

Python notebooks and Python code can be mixed and matched, and are interoperable with each other. You can include function from a notebook to a python code, and you can include python files in a notebook. 

 - **Data Directories should not contain logic code**

Data can be located anywhere, on remote HDFS clusters, or Object Store Services exposed via S3 protocols etc. However, in general is a good practice to keep some (or all data, if possible) locally on the file system. 

Separating data, configuration and code is done by moving all configuration to metadata files. Metadata files make possible to define aliases for data resources, data services and spark configurations, and keeping the spark code tidy with no hardcoded parameters. 

 - **Decouple Code from Configuration**

Notebook and Code should be decoupled from both engine configurations and from data locations. All configuration is kept in `metadata.yml` yaml files. Multiple setups for test, exploration, production can be described  in the same `metadata.yml` file or in separate multiple files.

 - **Declarative Configuration**

Metadata files are responsible for the binding of data and engine configurations to the code. For instance all data in the code shouold be referenced by an alias, and storage and retrieval of data object and files should happen via a common API. The metadata yaml file, describes the providers for each data source as well as the mapping of data aliases to their corresponding data objects. 



## Project Template

The data science project is structured in a way to facilitate the deployment of the artifacts, and to switch from batch processing to live experimentation. The top level project is composed of the following items:

### Top level Structure

In [1]:
import os
def tree(startpath):
    for root, dirs, files in os.walk(startpath):
        level = root.replace(startpath, '').count(os.sep)
        print('{}{}/'.format(' ' * 2 * (level-1) + '├─  ', os.path.basename(root)))
        for f in files: print('{}{}'.format('  ' * 2 * (level) + '├─  ', f))            
            
tree('.')

├─  ./
├─  main.ipynb
├─  metadata.yml
├─  __main__.py
├─  .ipynb_checkpoints/
    ├─  main-checkpoint.ipynb


## Data Lab Framework

In [2]:
import datalabframework as dlf

### Package things
Package version: package variables `version_info`, `__version__`

In [3]:
dlf.version_info

(0, 5, 9)

In [4]:
dlf.__version__

'0.5.9'

Check is the datalabframework is loaded in the current python context

In [5]:
try:
    __DATALABFRAMEWORK__
    print("the datalabframework is loaded")
except NameError:
    print("the datalabframework is not loaded")

the datalabframework is loaded


In [6]:
#list of modules loaded as `from datalabframework import * ` 
dlf.__all__

['logging', 'notebook', 'project', 'params', 'data', 'engines']

### Modules: project

Project is all about setting the correct working directories where to run and find your notebooks, python files and configuration files.   
When the datalabframework is imported, it starts by searching for a `__main__.py` file, according to python module file naming conventions.   
All modules and alias paths are all relative to this project root path.

In [7]:
# root path for the project
dlf.project.rootpath()

'/home/jovyan/work/demos/minimal/demo/src'

In [8]:
# this notebook
dlf.project.filename()

'main.ipynb'

In [9]:
# current workdir
dlf.project.workdir()

'/home/jovyan/work/demos/minimal/demo/src'

In [10]:
# profile - as defined in metadata
dlf.project.profile()

'default'

In [11]:
#project info, and current notebook
dlf.project.info()

{'profile': 'default',
 'filename': 'main.ipynb',
 'rootpath': '/home/jovyan/work/demos/minimal/demo/src',
 'workdir': '/home/jovyan/work/demos/minimal/demo/src'}

In [12]:
#all notebooks in this project
dlf.project.notebooks()

['main.ipynb']

In [13]:
# git info for the project, if available
dlf.project.repository()

{'type': 'git',
 'committer': 'natbusa',
 'hash': 'f8fac5e',
 'commit': 'f8fac5e9a4fb03aae4bf76606fdcecfa20fb80f3',
 'branch': 'master',
 'url': 'https://github.com/natbusa/datalabframework-demos.git',
 'name': 'datalabframework-demos.git',
 'date': '2018-10-17T08:32:56+07:00',
 'clean': False}

In [14]:
dlf.project.username()

'jovyan'

### Modules: Params

Configuration is declared in metadata. Metadata is accumulated starting from the rootpath, and metadata files in submodules are merged all up together.

In [15]:
metadata = dlf.params.metadata()
dlf.utils.pretty_print(metadata)

engines:
  spark-local:
    config:
      master: local[*]
    context: spark
loggers:
  stream:
    enable: true
    severity: info
profile: default
providers: {}
resources: {}
variables: {}



Data resources are relative to the `rootpath`. Next to the resources we can declare `providers` and `engines`. More about data binding in the next section.

### Modules: Notebook

This submodules contains a set of utilies to extract info from notebooks.

In [16]:
dlf.notebook.statistics('main.ipynb')

{'cells': 46,
 'markdown': 19,
 'code': 27,
 'ename': None,
 'evalue': None,
 'executed': 26}

### Modules: Engines

This submodules will allow you to start a context, from the configuration described in the metadata. It also provide, basic load/store data functions according to the aliases defined in the configuration.

Let's start by listing the aliases and the configuration of the engines declared in `metadata.yml`.


In [17]:
# get the aliases of the engines

metadata = dlf.params.metadata()
dlf.utils.pretty_print(metadata['engines'])

spark-local:
  config:
    master: local[*]
  context: spark



__Context: Spark__  
Let's start the engine session, by selecting a spark context from the list. Your can have many spark contexts declared, for instance for single node 

In [18]:
import datalabframework as dlf
engine = dlf.engines.get('spark-local')

PYSPARK_SUBMIT_ARGS:  pyspark-shell


You can quickly inspect the properties of the context by calling the `info()` function

In [19]:
engine.info

{'name': 'spark-local', 'context': 'spark', 'config': {'master': 'local[*]'}}

By calling the `context` method, you access the Spark SQL Context directly. The rest of your spark python code is not affected by the initialization of your session with the datalabframework.

In [20]:
spark = engine.context()

#print out name and version
'{}:{}'.format(engine.info['context'], spark.sparkSession.version)

'spark:2.3.2'

Once again, let's read the csv data again, this time using the spark context. First using the engine `write` utility, then directly using the spark context and the `dlf.data.path` function to localize our labeled dataset.

In [21]:
import random 
random.seed(1234)

values = [random.randint(0,10) for _ in range(1000)]

In [22]:
from pyspark.sql.types import IntegerType, StructType, StructField

cSchema = IntegerType()

# notice extra square brackets around each element of list 
df = spark.createDataFrame(values,schema=cSchema) 


In [23]:
df.show(5)

+-----+
|value|
+-----+
|    7|
|    1|
|    0|
|    1|
|    9|
+-----+
only showing top 5 rows



Let's do some basic spark processing on this simple dataframe

In [24]:
df.summary().show()

+-------+-----------------+
|summary|            value|
+-------+-----------------+
|  count|             1000|
|   mean|            5.235|
| stddev|3.157490260435453|
|    min|                0|
|    25%|                2|
|    50%|                5|
|    75%|                8|
|    max|               10|
+-------+-----------------+



### Modules: Logging

This submodules contains a set of utilies to log the results

In [25]:
logger = dlf.logging.getLogger()

In [26]:
logger.info('hello')

2018-10-17 06:23:53,415 - INFO - f8fac5e - datalabframework-demos.git - jovyan - main.ipynb - message - hello
