# Chapter 1: Getting Started

* 싸이그래머 : 스칼라ML 
* 김무성

# Contents

* Mathematical notation for the curious
* Why machine learning?
* Why Scala?
* Model categorization
* Taxonomy of machine learning algorithms
* Tools and frameworks
* Source code
* Let's kick the tires
* Summary

### A common data mining workflow consists of the following sequential steps:

1. Loading the data.
2. Preprocessing, analyzing, and filtering the input data.
3. Discovering patterns, affinities, clusters, and classes.
4. Selecting the model features and the appropriate machine learning algorithm(s).
5. Refining and validating the model.
6. Improving the computational performance of the implementation.


# Mathematical notation for the curious

<img src="figures/eq1.1.png" />

# Why machine learning?

* Classification
* Prediction
* Optimization
* Regression

<font color="red">The explosion in the number of digital devices generates an ever-increasing amount of data.</font> 

## Classification

* The purpose of classification is to extract knowledge from historical data. 
    - For instance, a classifier can be built to identify a disease from a set of symptoms. 
        - body temperature (continuous variable),
        - congestion (discrete variables HIGH, MEDIUM, and LOW),
        - the actual diagnostic (flu). 
        - model 
            - IF temperature > 102 AND congestion = HIGH THEN patient has the flu (probability 0.72)

## Prediction

* Once the model is extracted and validated against the past data, it can be used to draw inference from the future data. 
    - input : body temperature and nasal congestion, 
    - output : the state of his/her health.

## Optimization

* Some global optimization problems are intractable using traditional linear and non-linear optimization methods. 
* Machine learning techniques improve the chances that the optimization method converges toward a solution (intelligent search).

## Regression

* Regression is a classification technique that is particularly suitable for a continuous model. 
    - Linear (least square), polynomial, and logistic regressions are among the most commonly used techniques to fit a parametric model, or function, y= f (xj), to a dataset.
* Regression is sometimes regarded as a specialized case of classification for which the output variables are continuous instead of categorical

# Why Scala?

* Abstraction
* Scalability
* Configurability
* Maintainability
* Computation on demand

## Abstraction

Monoids and monads are important concepts in functional programming.

#### Monoid

* Monoids are associative operations. For instance, if ts1, ts2, and ts3 are three time series, then the property ts1 + (ts2 + ts3) = (ts1 + ts2) + ts2 is true. 
* The associativity of a monoid operator is critical in regards to parallelization of computational workflows.

In [None]:
trait Monoid[T] {
     def zero: T
     def op(a: T, b: T): c
}

#### Monad

* Monads are structures that can be seen either as containers by programmers or as a generalization of Monoids.
* Monads provide the ability for those collections to perform the following functions:
    1. Create the collection.
    2. Transform the elements of the collection.
    3. Flatten nested collections.
* Monads allow those collections or containers to be chained to generate a workflow.

In [None]:
trait Monad[M[_]] {
     def apply[T])(a: T): M[T]
     def flatMap[T, U](m: M[T])(f: T=>M[U]): M[U]
}

## Scalability

#### Actors

* Actors are the core elements that make Scala scalable. 
    - Actors act as coroutines, managing the underlying threads pool. 
    - Actors communicate through passing asynchronous messages.
* In a nutshell, a workflow is implemented 
    - as a sequence of activities or computational tasks. 
    - Those tasks consist of high-order Scala methods such as 
        - flatMap, 
        - map, 
        - fold, 
        - reduce, 
        - collect, 
        - join, or 
        - filter applied to a large collection of observations. 
    - Scala allows these observations to be partitioned by executing those tasks through a cluster of actors. 
* Scala also supports message dispatching and routing of messages between local and remote actors. 
* The engineers can decide to execute a workflow either locally or distributed across CPU cores and servers with no code or very little code changes.

<img src="figures/fig1.1.png" width=600 />

## Configurability

* Scala supports <font color="red">dependency injection</font> using 
    - a combination of abstract variables, 
    - self-referenced composition, and 
    - stackable traits. 
* One of the most commonly used dependency injection patterns, the <font color="red">cake pattern</font>, is used throughout this book to create dynamic computation workflows and plots.

## Maintainability

* Scala embeds <font color="red">Domain Specific Languages (DSL)</font> natively. 
* DSLs are syntactic layers built on top of Scala native libraries. 
* DSLs allow software developers to abstract computation in terms that are easily understood by scientists.

## Computation on demand

* <font color="red">Lazy</font> methods and values allow developers to execute functions and allocate computing resources on demand. 
* The Spark framework relies on lazy variables and methods to chain <font color="red">Resilient Distributed Datasets (RDD)</font>

# Model categorization

* A model can be predictive, descriptive, or adaptive.
* Predictive models
    - discover patterns in historical data and extract fundamental trends and relationships between factors. 
    - They are used to predict and classify future events or observations. Predictive analytics is used in a variety of fields such as marketing, insurance, and pharmaceuticals. 
    - Predictive models are created through supervised learning using a preselected training set.
* Descriptive models 
    - attempt to find unusual patterns or affinities in data by grouping observations into clusters with similar properties. 
    - These models define the first level in knowledge discovery. 
    - They are generated through unsupervised learning.
* Adaptive modeling, 
    - is generated through reinforcement learning. 
    - Reinforcement learning consists of one or several 
        - decision-making agents that 
            - recommend and 
            - possibly execute actions 
            - in the attempt of solving a problem, 
            - optimizing an objective function, or
            - resolving constraints.

# Taxonomy of machine learning algorithms

* Unsupervised learning 
    - Clustering
    - Dimension reduction
* Supervised learning
    - Generative models
    - Discriminative models
* Reinforcement learning

* <font color="red">The purpose of machine learning is to teach computers to execute tasks without human intervention.</font>
* <font color="red">Ultimately, machine learning algorithms consist of identifying and validating models to optimize a performance criterion using historical, present, and future data</font>

<img src="http://www.yger.net/wp-content/uploads/2013/02/learning.jpg" />

## Unsupervised learning

### Clustering

<img src="http://gerardnico.com/wiki/_media/data_mining/clustering.jpg" />

### Dimension reduction

<img src="http://www.mathworks.com/matlabcentral/fileexchange/screenshots/901/original.jpg" />

<img src="figures/fig1.2.png" width=600 />

## Supervised learning

<img src="http://www.yger.net/wp-content/uploads/2013/02/learning.jpg" />

<img src="http://sanghyukchun.github.io/images/post/61-1.jpg" />

<img src="http://image.slidesharecdn.com/textmining4-textclassification-florianleitner-140707121538-phpapp01/95/text-mining-45-text-classification-23-638.jpg?cb=1404735631" />

<img src="http://image.slidesharecdn.com/spatiallycoherentlatenttopicmodelforconcurrentobjectv1-3-091108054619-phpapp01/95/spatially-coherent-latent-topic-model-for-concurrent-object-segmentation-and-classification-5-728.jpg?cb=1257665696" />

<img src="http://www.evolvingai.org/sites/fish34.cs.uwyo.edu.lab/files/discriminative_vs_generative.png" />

<img src="https://syhw.files.wordpress.com/2010/06/discriminative_vs_generative-scaled1000.png" />

### Generative models

* Generative models attempt to fit a joint probability distribution, p(X,Y), of two events (or random variables), X and Y, representing two sets of observed and hidden (latent) variables x and y. 

<img src="figures/eq1.2.png" width=400 />

### Discriminative models

* Contrary to generative models, discriminative models compute the conditional probability p(Y|X) directly, using the same algorithm for training and classification.

#### Generative and discriminative models have their respective advantages and drawbacks.

<img src="figures/tbl1.1-1.png" width=600 />
<img src="figures/tbl1.1-2.png" width=600 />

<img src="figures/fig1.3.png" width=600 />

## Reinforcement learning

<img src="http://www.yger.net/wp-content/uploads/2013/02/learning.jpg" />

<img src="http://webdocs.cs.ualberta.ca/~sutton/book/ebook/figtmp7.png" />

<img src="http://www.cse.unsw.edu.au/~cs9417ml/RL1/images/overview.gif" />

<img src="http://www.pupin.rs/RnDProfile/images/research-topic02-01.jpg" />

* <font color="red">The foremost challenge developers of reinforcement learning systems face is that the recommended action or policy may depend on partially observable states and how to deal with uncertainty.</font>

<img src="figures/fig1.4.png" width=600 />

# Tools and frameworks

* Java
* Scala
* Apache Commons Math Description
* JFreeChart

## Java

## Scala

## Apache Commons Math Description

## JFreeChart

# Source code

https://www.packtpub.com/big-data-and-business-intelligence/scala-machine-learning

* Context versus view bounds
* Presentation
* Primitives and implicits
    - Primitive types
    - Type conversions
    - Operators
* Immutability
* Performance of Scala iterators

## Context versus view bounds

In [1]:
class MyClassInt[T <: Int]
class MyClassFloat[T <: Double]







In [None]:
Class MyClassFloat[T <% Double]
implicit def T2Double(t : T): Double

## Presentation

## Primitives and implicits

### Primitive types

In [2]:
type XY = (Double, Double)
type XYTSeries = Array[(Double, Double)]
type DMatrix[T] = Array[Array[T]]
type DVector[T] = Array[T]
type DblMatrix = DMatrix[Double]
type DblVector = Array[Double]







### Type conversions

In [3]:
implicit def int2Double(n: Int): Double = n.toDouble
implicit def vectorT2DblVector[T <% Double](vt: DVector[T]): DblVector
   = vt.map( t => t.toDouble)
implicit def double2DblVector(x: Double): DblVector = Array[Double](x)
implicit def dblPair2DbLVector(x: (Double, Double)): DblVector =
   Array[Double](x._1,x._2)
implicit def dblPairs2DblRows(x: (Double, Double)): DblMatrix =
   Array[Array[Double]](Array[Double](x._1, x._2))







### Operators

In [4]:
def Op[T <% Double](v: DVector[T], w: DblVector, op: (T, Double) =>
   Double): DblVector =
      v.zipWithIndex.map(x => op(x._1, w(x._2)))





In [None]:
implicit def /(v: DblVector, n: Int):DblVector = v.map( x => x/n)
implicit def /(m: DblMatrix, col: Int, z: Double): DblMatrix = { (0 
    until m(n).size).foreach(i => m(n)(i) /= z)  }

## Immutability

It is usually a good idea to reduce the number of states of an object.

## Performance of Scala iterators

# Let's kick the tires

* Overview of computational workflows
* Writing a simple workflow
    - Selecting a dataset
    - Loading the dataset
    - Preprocessing the dataset
    - Creating a model (learning)
    - Classify the data

## Overview of computational workflows

* A computational workflow <font color="red">to perform runtime processing of a dataset</font> is composed of the following stages:
    1. Loading the dataset from files, databases, or any streaming devices.
    2. Splitting the dataset for parallel data processing.
    3. Preprocessing data using filtering techniques, analysis of variance, and applying penalty and normalization functions whenever necessary.
    4. Applying the model, either a set of clusters or classes to classify new data.
    5. Assessing the quality of the model.
* A similar sequence of tasks is used <font color="red">to extract a model from a training dataset</font>:
    1. Loading the dataset from files, databases, or any streaming devices.
    2. Splitting the dataset for parallel data processing.
    3. Applying filtering techniques, analysis of variance, and penalty and normalization functions to the raw dataset whenever necessary.
    4. Selecting the training, testing, and validation set from the cleansed input data.
    5. Extracting key features, establishing affinity between a similar group of observations using clustering techniques or supervised learning algorithms.


<img src="figures/fig1.5.png" width=600 />

## Writing a simple workflow

* This book relies on <font color="red">financial data</font> to experiment with a different learning strategy.
* The objective of the exercise is to build a model that can <font color="red">discriminate between volatile and nonvolatile trading sessions</font>. 
* For this first example, we select a simplified version of the <font color="red">logistic regression</font> as our classifier as we treat a <font color="red">stock-price-volume action</font> as a <font color="red">continuous or pseudo-continuous process</font>.
* The classification of trading sessions according to their volatility is as follows:
    - Select a dataset
    - Load the dataset
    - Preprocess the dataset
    - Display data
    - Create the model through training
    - Classify new data

### Selecting a dataset

<img src="figures/fig1.6.png" width=600 />

### Loading the dataset

In [None]:
def load(fileName: String): Option[XYTSeries] = {
     val src =  Source.fromFile(fileName)
     val fields = src.getLines.map( _.split(CSV_DELIM)).toArray //1
     val cols = fields.drop(1) //2
     val data = transform(cols)
     src.close //3
     Some(data)
}

### Preprocessing the dataset

* Basic statistics
* Normalization and Gauss distribution

The next step is to normalize the data in the range [-0.5, 0.5] to be trained by the logistic binary classifier.

#### Basic statistics

In [None]:
val mean = price.reduceLeft( _ + _ )/price.size
val s2 = price.foldLeft(0.0)((s,x) =>s+(x-mean)*(x-mean))
val stdDev = Math.sqrt(s2/(price.size-1) )

In [None]:
class Stats[T <% Double](private values: DVector[T]) {
class _Stats(var minValue: Double, var maxValue: Double, var sum:
   Double, var sumSqr: Double)
   val stats = {
     val _stats = new _Stats(Double.MaxValue, Double.MinValue, 0.0, 0.0)
     values.foreach(x => {
        if(x < _stats.minValue) x else _stats.minValue
        if(x > _stats.maxValue) x else _stats.maxValue
        _stats.sum + x
        _stats.sumSqr + x*x
     })
     _stats 
   }
    
   lazy val mean = _stats.sum/values.size
   lazy val variance = (_stats.sumSqr - mean*mean*values.size)/(values.
   size-1)
   lazy val stdDev = if(variance < ZERO_EPS) ZERO_EPS else Math.
   sqrt(variance)
   lazy val min = _stats.minValue
   lazy val max = _stats.mazValue
}

#### Normalization and Gauss distribution

In [None]:
def normalize: DblVector = {
     val range = max – min;  values.map(x => (x - min)/range)
}

In [None]:
def gauss: DblVector =
      values.map(x =>{
         val y=x-mean
         INV_SQRT_2PI/stdDev*Math.exp(-0.5*y*y/stdDev)})

<img src="figures/fig1.7.png" width=600 />

In [None]:
object YahooFinancials extends Enumeration {
    type YahooFinancials = Value
    val DATE, OPEN, HIGH, LOW, CLOSE, VOLUME, ADJ_CLOSE = Value
    val volatility = (fs: Array[String]) =>fs(HIGH.id).toDouble-fs(LOW.
   id).toDouble
... 
}

In [None]:
def transform(cols: Array[Array[String]]): XYTSeries = { val volatility = Stats[Double](cols.map(YahooFinancials.
   volatility)).normalize
     val volume =  Stats[Double](cols.map(YahooFinancials.volume)
   ).normalize
     volatility.zip(volume)
}

### Plotting data

In [None]:
val xLegend = "Session Volatility"
val yLegend = "Session Volume"
def display(xy: XYTSeries, w: Int, h : Int): Unit = {
val series = new XYSeries("CSCO 2012-2013 Stock") xy.foreach( x => series.add( x._1,x._2))
        val seriesCollection = new XYSeriesCollection
        seriesCollection.addSeries(series)
       ... // plot rendering code
        val chart = ChartFactory.createScatterPlot(xLegend, xLegend,
   yLegend, seriesCollection, PlotOrientation.VERTICAL, true, false,
   false)
        createFrame("Logistic Regression", chart)
     }

In [None]:
val plot = new ScatterPlot(("CSCO 2012-2013", "Session High - Low",
"Session Volume"), new BlackPlotTheme)
plot.display(volatility_vol.filter( _._1 < 0.5), 250, 340)

<img src="figures/fig1.8.png" width=600 />

### Creating a model (learning)

In [None]:
class LogBinRegression(val labels: DVector[(XY, Double)], val maxIters: Int, val eta: Double, val eps: Double) {
     val dim = 3
     val weights = train
     def classify(xy: XY): Option[(Boolean, Double)] = {
       if(weights != None) {
          val likelihood = sigmoid(w(0) + xy._1*w(1) + xy._2*w(2))
          Some(likelihood > 0.5, likelihood)
       }
     else None 
     }
     def train: Option[DblVector] = {
        val w = Array.fill(dim)( x=> Random.nextDouble-1.0)
        Range(0, maxIters).find(_ => {
            val deltaW = labels.foldLeft(Array.fill(dim)(0.0))((dw, lbl) => {
              val y = sigmoid(w(0) + w(1)*lbl._1._1 +  w(2)*lbl._1._2)
              dw.map(dx => dx + (lbl._2 - y)*(lbl._1._1 + lbl._1._2))
           })
        val nextW = Array.fill(dim)(0.0)
                        .zipWithIndex
                        .map(nw => w(nw._2)+eta*deltaW(nw._2))
        val diff = Math.abs(nextW.sum - w.sum)
        nextW.copyToArray(w); diff < eps 
       }) match {
           case Some(iters) => Some(w)
           case None => { ... }
       }
    }
    
    def sigmoid(x: Double):Double = 1.0/(1.0 + Math.exp(-x))
    

In [None]:
val labels = volatilityVol.zip(volatilityVol.map(x =>if( x._1>0.3 &&
   x._2>0.3) 1.0 else 0.0))

In [None]:
val logit = new LogBinRegression(labels, 300, 0.00005, 0.02)

### Classify the data

In [None]:
Date,Open,High,Low,Close,Volume,Adj Close
3/9/2011,14.78,15.08,14.20,14.91,4.79E+08,14.88
11/17/2009,10.78,10.90,10.62,10.84,3901987,10.85

In [None]:
val testData = load("resources/data/chap1/CSCO2.csv")
   logit.classify(testData(0)) match {
     case Some(topCategory) => Display.show(topCategory)
     case None => { ... }
   }
logit.classify(testData(1)) match {
     case Some(topCategory) => Display.show(topCategory)
     case None => { ... }
}

# Summary

* From monadic composition and high-order collection methods for parallelization to configurability to reusability patterns, Scala is the perfect fit to implement and leverage data mining and machine learning algorithms for large-scale projects
* There are many steps to create and apply a machine learning model
* The implementation of the logistic binary classifier presented as part of the test case is simple enough to encourage you to learn how to write and apply more advanced machine learning algorithms

# 참고자료