# Welcome to the Dark Art of Coding:
## Introduction to Python

PyData NYC 2017 Overview

<img src='./logos.3.600.wide.png' height='250' width='300' style="float:right">

Agenda:

* notebooks: beakerx, jupyter lab
* pandas: future of pandas, head() to tail(), idiomatic python, datetimes
* ml learning: sci-kit learn on top of pandas, shogun ml
* pyspark: arrow, feather
* visualization: interactive matplotlib
* numfocus: state of numfocus
* git: using git metadata to predict code bug risk


# Notebooks
---

## `beakerx`

* what is beaker?
* what is beakerx?
* how are they different?

beaker was a jupyter-style notebook. It was supported by a company called two sigma.
beakerx is a port of that as a jupyter notebook extension as opposed to a standalone product.
beakerx adds significant functionality to jupyter notebooks such that you can do things like:

* **polyglot**: run multiple computing languages IN THE SAME notebook
* **enhances pandas** dataframe capabilities
* **quick publish** directly from the notebook
* **coming now-ish**: integrate with jupyter lab (more on this later) (alpha quality)

In [1]:
# This is also an initialization cell

import time
start_time = time.time()
time.sleep(5)
print("Elapsed Time:", time.time() - start_time)

Elapsed Time: 5.002179861068726


In [4]:
# This is NOT an initialization cell
print("During initialization, nothing should have happened")

During initialization, nothing should have happened


In [2]:
# This is also an initialization cell

import time
start_time = time.time()
time.sleep(5)
print("Elapsed Time:", time.time() - start_time)

Elapsed Time: 5.002023935317993


In [3]:
# This is an initialization cell

import pandas as pd
import numpy as np
import sklearn

In [10]:
b = pd.DataFrame({'alpha':   list(range(1, 1001)),
                  'beta':    list(range(1001, 3000, 2)),     # skip by twos
                  'gamma':   list(range(1, 11)) * 100,       # repeating sequence
                  'delta':   list(range(10, 110, 10)) * 100,
                  'epsilon': list(range(1, 11)) * 100,
   'zeta_long_name_column':  list(range(1, 11)) * 100,
    'eta_long_name_column':  list(range(1, 11)) * 100,
  'theta_long_name_column':  list(range(1, 11)) * 100,
   'iota_long_name_column':  list(range(1, 11)) * 100,
  'kappa_long_name_column':  list(range(1, 11)) * 100,
 'lambda_long_name_column':  list(range(1, 11)) * 100,
                      'mu':  list(range(1, 11)) * 100,})

b             

In [6]:
import beakerx.runtime

In [11]:
# data bars
# heat map
# filter
# reorganize

b

### Python to Javascript sample

http://nbviewer.jupyter.org/github/twosigma/beakerx/blob/master/doc/python/autotranslation_python.ipynb

### download it

```python
conda install -c conda-forge ipywidgets beakerx
```    
    

## jupyter labs

* what is jupyter
* what is jupyter labs

  * notebook and server interface
  * tabs (and panels)
  * renderers (json, csv) > panel interface & efficient renderers
  * real-time collaboration
  * file browser
  * and more

see the video:
https://www.youtube.com/watch?v=dSjvK-Z3o3U

A fun demo of interactive datamodels within Jupyter Lab
https://youtu.be/dSjvK-Z3o3U?t=666

* nothing is special > everything is a plugin > just about everything is customizable...

# pandas
---

## future of pandas

* pandas2 is coming
* key goals:

   * Fixing long-standing limitations or inconsistencies in missing data: null values in integer and boolean data, and a more consistent notion of null / NA.
   * Improved performance and utilization of multicore systems
   * Better user control / visibility of memory usage (which can be opaque and difficult to conttrol)
   * Clearer semantics around non-NumPy data types, and permitting new pandas-only data types to be added
   * Exposing a "libpandas" C/C++ API to other Python library developers: the internals of Series and DataFrame are only weakly accessible in other developers' native code. This has been a limitation for scikit-learn and other projects requiring C or Cython-level access to pandas object data.
   * Removal of deprecated functionality

https://github.com/pandas-dev/pandas2/blob/master/source/goals.rst

## pandas talks to keep your eye on:


* head() to tail()

* idiomatic pandas

* datetimes in pandas <<< `disclaimer, I gave this talk`

    
    
    

# machine learning
---

* scikit-learn on top of pandas >> scikit-learn rides on numpy and thus can't access many of the features of pandas
* shogun ml

# pyspark
---

Data portability is a major issue between systems > a proposed solution is **ARROW**

## The problems

<img src='./copy2.png' height='300' width='400' style="float:center">

* Each system has its own internal memory format
* 70-80% computation wasted on serialization and deserialization
* Similar functionality implemented in multiple projects

<img src='./simd.png' height='400' width='500' style="float:center">

* memory formats are often not well designed for computers

## A proposed solution
<img src='./shared2.png' height='300' width='400' style="float:center">

* All systems utilize the same memory format
* No overhead for cross-system communication
* Projects can share functionality (eg, Parquet-to-Arrow reader)

* columnar formats should speed up operations like:
    * sorting
    * searching
    * aggregations (sum, mean, etc)
    * groupings


## There's more:

* **ARROW** >> in memory data structure
* **FEATHER** >> on disk data structure

See: https://arrow.apache.org/ 

# visualization
---

* interactive matplotlib

# numfocus
---

* state of numfocus

DONATE: 
https://www.numfocus.org/open-source-projects/

# git fun
---

* using git metadata to predict code bug risk

Getting more:
---

Keep your eye on the pydata website
https://pydata.org/events.html

Keep your eye on the pydata youtube channel
https://www.youtube.com/user/PyDataTV
    