# Day 2

## Density-based clustering in Python

*   k-means not always the best method
    *   default, but not best
    *   simple, resources available, implementations available, scaleable
    *   assumes hyperspheres, may be hard to choose k
*   Use density-based clustering
    *   unknown underlying PDF
    *   premise: estimate PDF
    *   want to: threshold PDF and take **upper level set** and find the **connected components** of the upper level set
    *   intersect datapoints with clusters, label using clusters
    *   **pros** complex cluster shapes, don't need to know k, automatically find outliers
    *   **cons** not as scalable, need distance metric, connected components computation expensive, not great for high dimensional data
    *   popular: DBSCAN
        *   partition data into core, boundary, noise groups
        *   connect core points into clusters, associate with boundary points
        *   scikit-learn
            *   params
                *   eps: neighbourhood radius
                *   min_samples: min naighbours to form a cluster
            *   can use any distance function you want
        *   also in GraphLab (Dato)
    *   applications: document deduplication
        *   creates dictionary with word count
        *   use jaccard distance, not euclidean
*   Use level set trees
    *   how to choose density level in DBSCAN?
    *   want to change denstiy level $\implies$ start from scratch
    *   create tree once, pick your density later by just querying the tree
    *   visualisation of high dim data
    *   how to build?
        *   estimate density at each point
        *   construct similarity graph on the data: edges are points, vertices connect near neighbours (kNN)
        *   keep track of components between graphs
        *   remove lowest density points and their vertices until you get a split
        *   identify outliers by taking the lowest 5% (density) points
        *   shine with complex data (e.g. hurricane tracks)
        *   DeBaCl


## Data Workflows

* reproducibility
    * spectrum: publication only $\iff$ full replication
    * transparency
    * don't want to rewrite code
    * productising insights
* workflows
    * define separate process
    * high level abstraction of the underlying processes
    * types
        * environment/purpose specific **pipelines**
            * e.g scikit-learn's pipeline; a little like chaining callbacks
            * transparency of model
            * easily rework pipeline; configurable building blocks
        * cross-environment processing pipelines
            * luigi
                * requires output file at each step
                * each task (is a class)
                    * params
                    * dependencies
                    * do stuff
                    * output
            * airflow
                * explicit dependencies
                * scheduling
        * REPL notebook environments (e.g. jupyter)
            * good for small data
* workflow engines


## Understanding Probabilistic Topic Models By Simulation

*   how to cluster data?
*   k means?
*   gaussian mixture models?
*   forward sampling
*   bayesian style?
    *   model as dirichlet distribution (multinomial distribution)
*   generative process for distributions
*   given data
    *   find cluster for each point
    *   find params of each cluster
    *   find mixture proportion
*   discrete mixture model
    *   e.g. flip 3 sided weighted coin
*   topics as distributions over words
*   reverse sampling
    *   given collection of documents
        *   find topic for each document
        *   find distribution over topics
        *   find distribution over terms for each topic
    *   cons
        *   assuming word order irrelevant
        *   single topc/document
        *   assume number of topics known
        *   assumes topics are uncorrelated
*   LDA
    *   can tell when a word with two meanings might belong to different topics
    *   HDPLDA (hierarchical Dirichlet process LDA)
    *   read Blei's LDA paper (available on ACM website)
    *   pyLDAvis
    *   paper by Griffiths and Steyvers
    *   Applications
        *   LDA for genetic sequences

## Parallelising Python

Multiprocessing in python

joblib, multiprocessing

*   easy to //
    *   x-validation
    *   grid search
    *   random forest
    *   kernel density

GPU accelerated

*   PyCUDA

Neural net: Lasagne