# PyData
November 9, 2015


## The Zen of Data Science

In [None]:
import this

Blaze: SQL for Python?

### The "Triangle Data Stack"

Blaze, Odo, Dask, [Datashape](datashape.pydata.org)

Blaze Server: "Blaze provides uniform access to a variety of common data formats. Blaze Server builds off of this uniform interface to host data remotely through a JSON web API."

Dask: Out-of-Core PyData

*   Lower level
*   A parallel computing framework
*   Simple library to enable parallelism
*   collections $\rightarrow$ graphs $\rightarrow$ schedulers
*   returns "Promise", if result doesn't fit in memory, will be handled some other way
*   parallel cross-validation

Spark

[Distributed](https://media.readthedocs.org/pdf/distributed/latest/distributed.pdf)

conda over docker?

### Check Out

*   Check out biopython, bokeh, scikit-image
*   [Anaconda Cloud](anaconda.org)
*   [Anaconda Cluster](http://docs.continuum.io/anaconda-cluster/index)
*   [Blaze](blaze.pydata.org)
*   [Numba](numba.pydata.org): Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.


## Python in Astronomy

### The Hubble Space Telescope

Image correction for spherical abberation was no match for actually going up to correct the mirror

### Software

*   Calibration
*   Combine / reduce data
*   analyse data
*   simulate observations
*   plan observations

Used Fortran, then IRAF (~Java $\because$ portable), but needed custom language (C $\times$ Fortran $\rightarrow$ SPP). Not great.

Others wanted to link to other languages, but IRAF didn't support it. Couldn't evolve.

New alternate user interface: PyRAF

Small goals at first, but power of python blew them away (o.o) and their goals got bigger

python array $\rightarrow$ numarray $\rightarrow$ numpy

matplotlib

PyFITS

### Applications

*   Astrodrizzle
    *   Combine different exposures taken from different pointings, and also reject cosmic rays
    *   0.003 pixel error noticeable

### James Web Space Telescope

IR telescope, ~40 K, L2 point

Needed to record how structure would perform once cooled down to 40K, recorded data using interferometers, analysis and visualisation in python

Python community has better culture?

Astropy!


## Memex: Mining the Deep Web

Surface web: search engine

Deep web: un-crawled space

Started by DARPA

Want to basically be able to search beyond current search capabilities of web crawlers

e.g. want to catch Dark Web criminals, e.g. founder of Silk Road, human trafficking, drugs, weapons, child exploitation, assassin for hire, etc.

### Large Scale Data Analytics

business intelligence / databases

DM/Stats/ML

Scientific Computing

Distributed Systems

Cross links: mahout, RHadoop

### Analytics Pipeline

Scrape, extract, index (could be dynamic content, images, etc.), visualise, search applications

**Memex explorer**: Django web app, elasticsearch index, bokeh visualisations, kibana dash. Apache Nutch, NYU ACHE crawlers, NYU's domain discovery tool

CMU time anomaly detection tool (analyse time-series data, e.g. finance data), Sotera Datawake (follows a subject expert as they do investigation on the web)

Apache Tika: a content analysis tool

ACHE crawler: naive bayes classifier???

Data is passed to law enforcement agencies, open investigations

## Panel

Julia, R, Python

Neural networks overhyped, not much breakthrough, can't analyse what happens within

Spark overhyped? Don't use for machine learning

Null hypotheses outdated, use cross validation?

Use p-values in A-B testing, so that you know when to stop running the experiment to get statistically significant results

WAIC to avoid overfitting?

New result: impossibility theorem for clustering (like Arrow's impossibility theorem but for clustering?). Need to think a lot about initialisation, use spectral methods, consensus algorithms

additive models, non-negative matrices???

Julia vs R vs Python: JSON as universal "glue"

blackboxing

matplotlib vs d3 vs etc.

NumFocus

new software: gradient boosted tree with elastic net

machine teaching?


## New Features

### matplotlib

v1.5.0 came out 2 weeks ago

*   Q4 2015 2.0 style changes
*   Q4 2016 2.1 regular feature release

*   interactive front end for mpl in ipython notebook
    *   full mouse/key events back into python layer
    *   renders animations
*   a ton of styles available now
    *   can fetch styles via http request
*   auto updating without re-running kernel
*   string labels to bar plots (add emojis to your plots!)
*   style cycler like in ggplot2
*   labelled data support (i.e. import as "data frame", plot instantly with column labels)
*   autowrapping of long text
*   contour corner masking
*   smooth colorbars, legend support, etc.

### pandas

more useful string and dict-like selections

select_dtypes

categoricals (pseudo-dtype)

timedeltas now 1st class type/objects?

dt acessors

assign & pipe

sample

release the gil

.groupby = free preformance!

.plot accessor

big api change: .sort_values()

*   every object has sort_index() or sort_values()

more dtypes in the future

Int NA, numba, dask support

df .style

### scikit-learn

#### 0.17

released last thurs

LDA using online variational ???

SAG for logistic, ridge regression

coordinate descent solver for non-negative matrix factorisation (speed boost)

Barnes0Hut  approx for t-SNE manifold learning

FunctionTransformer (basically a map)

VotingClassifier

Scalers for preporcessing RobustScaler, MaxAbsScaler

add backlinks to docs

#### 0.18

gaussian process rewrite

*   a whole lot of kernels

new gradient based unsupervised stuff

neural networks

better cross validation

faster pca

### jupyter

new website

ipython 4.0

command to start no longer

In [None]:
ipython notebook

but

In [None]:
jupyter notebook

live rendering of jupyter notebooks in github

jupyter-incubator

*   lightweight process to start subprojects

multiple selections, multiple copy/pastes of cells

j/k to navigate between cells

multi-cell find and replace, regex supported

command palette (cmd-shift-p)

restart and run-all

#### 5.0

jupyter workbench: text editor, file browser, terminal, etc. in a single page.

entire thing will be npm packages

js apis

## Lightning Talks

### Bokeh

rich interaction

check out tonyfast's github

### don't use k-means

don't use k-means for (exploratory) clustering; it partitions!

assumes gaussian balls

don't know number of clusters

how then? don't be wrong. don't need to be completely right, just don't be wrong

need stable clusterings, "well-posedness"

*   instead use
    *   DBSCAN
    *   robust single linkage
    *   HDBSCAN (single parameter)

### plotting

oversaturation in plots of large datasets?

[abstract rendering](http://bokeh.pydata.org/en/0.8.1/docs/user_guide/ar.html)

data -> aggregate -> image

map individuals to pixels, log the data