# Prologue

__Keywords covered__ 
* Applying the concept of monadic design
* Scala's advanced functional features: e.g. dependency injection
* The bias-variance trade-off	
* Overfitting					 			
* Training, test, and validation sets	
* Precision, recall, and F score 	

# Modeling

* A model is composed of features (also known as attributes or variables), and a set of relation between those features:
					f(x, y) = x.sin(2y)
		 	 	 	f(x, y) < 20 				
* a monoid for which the set is a group of observations and the operator is the function implementing the model. 


__[Shapes and forms]__	 							
* Parametric: This consists of functions and equations 
(for example, y = sin(2t + w))	
* Differential: This consists of ordinary and partial differential equations
(for example, dy = 2x.dx)	
* Probabilistic: This consists of probability distributions 
(for example, p (x|c) = exp (k.logx – x)/x!)
* Graphical: This consists of graphs that abstract out the conditional independence between variables 
(for example, p(x,y|c) = p(x|c).p(y|c))
* Directed graphs: This consists of temporal and spatial relationships 
(for example, a scheduler)
* Numerical method: This consists of finite elements and methods such as Newton-Raphson	
* Chemistry: This consists of formula and components (for example, H2O, Fe + C12 = FeC13, and so on) 	
* Taxonomy: This consists of a semantic definition and relationship of concepts 
(for example, APG/Eudicots/Rosids/Huaceae/Malvales)
* Grammar and lexicon: This consists of a syntactic representation of documents 
(for example, Scala programming language)
* Inference logic: This consists of a distribution pattern such as IF (stock vol > 1.5 * average) AND rsi > 80 THEN... 	

##Model versus design

* Modeling: Describing something you know. A model makes the assumption, which becomes an assertion if proven correct 	
* Designing: This is manipulating representation for things you don't know. Designing can be seen as the exploration phase of modeling 


##Selecting a model's features 

1. Searching for new feature subsets.
2. Evaluating these feature subsets using a scoring mechanism. 

##Extracting features

* Purpose: to reduce the number of variables or dimensions of the model by eliminating redundant or irrelevant features.
* by transforming the original set of observations into a smaller set at the risk of losing some vital information embedded in the original set. 


#Designing a workflow

<img src="Screen Shot 2015-06-27 at 4.46.48 AM.png">
* Scala's higher-order functions are particularly suitable for implementing configurable data transformations. 		

#The computational framework 

<img src="Screen Shot 2015-06-27 at 4.50.37 AM.png">
* to define a trait and a method that describes the transformation of data by a computation unit (element of the workflow). 

#The pipe operator 

* to define a symbolic representation of the transformation of different types of data without exposing the internal state of the algorithm implementing the data transformation. 

In [None]:
trait PipeOperator[-T, +U] {
def |> (data: T): Option[U]
}

* The |> operator transforms a data of the type T into a data of the type U and returns an option to handle internal errors and exceptions.

##Monadic data transformation 
* to create a monadic design to implement the pipe operator

In [None]:
class _FCT[+T](val _fct: T) {
def map[U](c: T => U): _FCT[U] = new _FCT[U]( c(_fct))
def flatMap[U](f: T =>_FCT[U]): _FCT[U] = f(_fct)
def filter(p: T =>Boolean): _FCT[T] = if( p(_fct) ) new _FCT[T](_
fct) else zeroFCT(_fct)
def reduceLeft[U](f: (U,T) => U)(implicit c: T=> U): U = f(c(_fct),
_fct)
def foldLeft[U](zero: U)(f: (U, T) => U)(implicit c: T=> U): U =
f(c(_fct), _fct)
def foreach(p: T => Unit): Unit = p(_fct)
} 

* The methods of the _FCT class represent a subset of the traditional Scala higher methods for collections

In [None]:
class Transform[-T, +U](val op: PipeOperator[T, U]) extends _ FCT[Function[T, Option[U]]](op.|>) {
def |>(data: T): Option[U] = _fct(data) } 

* You can create any algorithm by just implementing the PipeOperator trait, after all.
* The reason is that Transform has a richer protocol (methods) and enables developers to create a complex workflow as an alternative to the delimited continuation.

* Examlple: a generic function composition or data transformation composition using the monadic approach

In [None]:
val op = new PipeOperator[Int, Double] {
def |> (n: Int):Option[Double] =Some(Math.sin(n.toDouble)) }
def g(f: Int =>Option[Double]): (Int=> Long) = {
     (n: Int) => {
       f(n) match {
           case Some(x) => x.toLong
         case None => -1L
} }
   }
   val gof = new Transform[Int,Double](op).map(g(_))

#Dependency injection
* workflow composed of configurable data transformations requires a dynamic modularization (substitution) of the different stages of the workflow. 
* __Cake pattern__: an advanced class composition pattern that uses mix-in traits to meet the demands of a configurable computation workflow (also known as stackable modification traits)
* __Dependency injection__: a reverse look up and binding to dependencies

* _for classification cases_

In [None]:
  val myApp = new Classification with Validation with PreProcessing {
   val filter = .. }


* _then at a later stage, need to an unsupervised clustering algorithm : code duplication and lack of flexibility_

In [None]:
   val myApp = new Clustering with Validation with PreProcessing { val
   filter = ..  }

* The solution is to use the self type in the definition of the newly composed PreProcessingWithValidation trait:

In [None]:
val myApp = new Classification with PreProcessingWithValidation { val validation: Validation
￼￼￼￼￼￼￼￼￼￼￼￼}

#Workflow modules


* The data transformation defined by the PipeOperator instance is dynamically injected into the module by initializing the abstract value.
* three parameterized modules representing the preprocessing, processing, and post-processing stages of a workflow:

In [None]:
trait PreprocModule[-T, +U] { val preProc: PipeOperator[T, U] }
   trait ProcModule[-T, +U] { val proc: PipeOperator[T, U] }
   trait PostprocModule[-T, +U] { val postProc: PipeOperator[T, U] }


* One characteristic of the Cake pattern is to enforce strict modularity by initializing the abstract values with the type encapsulated in the module

In [None]:
trait ProcModule[-T, +U] {
      val proc: PipeOperator [T, U]
      class Classification[-T, +U] extends PipeOperator [T,U] { }
}

#The workflow factory

* The next step is to write the different modules into a workflow
* using the self reference to the stack of the three traits defined in the previous paragraph.

* you can either create a workflow with a stack of two traits or initialize the third with the PipeOperator identity:

In [None]:
def identity[T] = new PipeOperator[T,T] {
      override def |> (data:T): Option[T] = Some(data)
}


In [None]:
class Sampler(val samples: Int) extends PipeOperator[Double => Double, DblVector] {
      override def |> (f: Double => Double): Option[DblVector] =
      Some(Array.tabulate(samples)(n => f(n.toDouble/samples)) )
   }
class Normalizer extends PipeOperator[DblVector, DblVector] { override def |> (data: DblVector): Option[DblVector] =
        Some(Stats[Double](data).normalize)
   }
class Reducer extends PipeOperator[DblVector, Int] { override def |> (data: DblVector): Option[Int] = Range(0, data.size) find(data(_) == 1.0)
}

<img src ="Screen Shot 2015-06-27 at 11.55.56 AM.png">

In [None]:
val dataflow = new Workflow[Double => Double, DblVector, DblVector, Int]
                with PreprocModule[Double => Double, DblVector]
                  with ProcModule[DblVector, DblVector]
                    with PostprocModule[DblVector, Int] {
     val preProc: PipeOperator[Double => Double,DblVector] = new
   Sampler(100) //1
   
     val proc: PipeOperator[DblVector,DblVector]= new Normalizer //1
     val postProc: PipeOperator[DblVector,Int] = new Reducer//1
   }
dataflow |> ((x: Double) => Math.log(x+1.0)+Random.nextDouble) match { case Some(index) => ...

* Scala's strong type checking catches any inconsistent data types at compilation time.
* It reduces the development cycle because runtime errors are more difficult to track down

#Examples of workflow components

##The preprocessing module

* use the time series class

In [None]:
class XTSeries[T](label: String, arr: Array[T]

* The preprocessing algorithms such as moving average or discrete Fourier filters are encapsulated into a preprocessing module

In [None]:
  trait PreprocessingModule[T] {
     val preprocessor: Preprocessing[T]  //1
abstract class Preprocessing[T] { //2 def execute(xt: XTSeries[T]): Unit
}
abstract class MovingAverage[T] extends Preprocessing[T] with PipeOperator[XTSeries[T], XTSeries[Double]] { //3
override def execute(xt: XTSeries[T]): Unit = this |> xt match { case Some(filteredData) => ...
       case None => ...
       }
}
class SimpleMovingAverage[@specialized(Double) T <% Double](period: Int)(implicit num: Numeric[T]) extends MovingAverage[T] {
override def |> (xt: XTSeries[T]): Option[XTSeries[Double]] =
...
}
class DFTFir[T <% Double](g: Double=>Double) extends Preprocessing[T]
extends PreProcessing[T] with PipeOperator[XTSeries[T], XTSeries[Double]] {
override def execute(xt: XTSeries[T]): Unit = this |> xt match { case Some(filteredData) => ...
case None => ...
}
override def |> (xt: XTSeries[T]) : Option[XTSeries[Double]] }
}

##The clustering module

* the second example of a module for dependency-injection-based workflow:

In [None]:
trait ClusteringModule[T] {
     type EMOutput = List[(Double, DblVector, DblVector)]
     val clustering: Clustering[T]
     abstract class Clustering[T] {
     def execute(xt: XTSeries[Array[T]]): Unit
}
     
     class KMeans[T <% Double](K: Int, maxIters: Int, distance: (DblVector, Array[T]) => Double)(implicit order: Ordering[T], m: Manifest[T]) extends Clustering[T] with PipeOperator[XTSeries[Array [T]], List[Cluster[T]]] {
override def |> (xt: XTSeries[Array[T]]): Option[List[Cluster[T]]]
       override def execute(xt: XTSeries[Array[T]]): Unit = this |> xt
   match {
         case Some(clusters) => ...
         case None => ...
       }
}
class MultivariateEM[T <% Double](K: Int) extends Clustering[T] with PipeOperator[XTSeries[Array[T]], EMOutput] {
override def |> (xt: XTSeries[Array[T]]): Option[EMOutput] = override def execute(xt: XTSeries[Array[T]]): Unit = this |> xt
   match {
          case Some(emOutput) => ...
          case None => ...
       }
} }

#Assessing a model
* Evaluating a model is an essential part of the workflow. There is no point in creating the most sophisticated model if you do not have the tools to assess its quality. The validation process consists of defining some quantitative reliability criteria, setting
a strategy such as an N-Fold cross-validation scheme, and selecting the appropriate labeled data.

##Validation

* The purpose of this section is to create a Scala class to be used in future chapters for validating models. For starters, the validation process relies on a set of metrics to quantify the fitness of a model generated through training.


###Key metrics 
<img src="Screen Shot 2015-06-27 at 4.59.32 AM.png">
<img src="Screen Shot 2015-06-27 at 5.00.12 AM.png">

###Implementation
* validation formula using the same trait-based modular design used in creating the preprocessor and classifier modules

In [None]:
 trait Validation {
     def f1: Double
     def precisionRecall: (Double, Double)
}

* The array of actual versus expected class: actualExpected
* The target class for true positive observations: tpClass

In [None]:
class F1Validation(actualExpected: Array[(Int, Int)], tpClass: Int) extends Validation {
         val counts = actualExpected.foldLeft(new Counter[Label])((cnt,
       oSeries) => cnt + classify(oSeries._1, oSeries._2))
         lazy val accuracy = {
           val num = counts(TP) + counts(TN)
           num.toDouble/counts.foldLeft(0)( (s,kv)  => s + kv._2)
}
lazy val precision = counts(TP).toDouble/(counts(TP) + counts(FP))
lazy val recall = counts(TP).toDouble/(counts(TP) + counters(FN))
override def f1: Double = 2.0*precision*recall/(precision + recall)
override def precisionRecall: (Double, Double) = (precision, recall)
         def classify(actual: Int, expected: Int): Label = {
            if(actual == expected) { if(actual == tpClass) TP else TN }
            else { if (actual == tpClass) FP else FN }
￼} }

* The precision and recall variables are defined as lazy so they are computed only once, when they are either accessed for the first time or the f1 and precisionRecall functions are invoked.

In [None]:
object Label extends Enumeration { type Label = Value
val TP, TN, FP, FN = Value
}

##K-fold cross-validation 
<img src="Screen Shot 2015-06-27 at 5.06.07 AM.png">

##Bias-variance decomposition 
<img src="Screen Shot 2015-06-27 at 5.04.40 AM.png">

In [None]:
class BiasVarianceEmulator[T <% Double](emul: Double => Double, nValues: Int) {
def fit(fEst: List[Double => Double]): Option[XYTSeries] = { val rf = Range(0, fEst.size)
val meanFEst = Array.tabulate(nValues)( x =>
      rf.foldLeft(0.0)((s, n) => s+fEst(n)(x))/fEst.size) // 1
        val r = Range(0, nValues)
        Some(fEst.map(fe => {
           r.foldLeft(0.0, 0.0)((s, x) => {
             val diff = (fe(x) - meanFEst(x))/ fEst.size   // 2
             (s._1 + diff*diff, s._2 + Math.abs(fe(x)-emul(x)))} )
        }).toArray)
     }
}

<img src="formula.png">

In [None]:
val emul = (x: Double) => 0.2*x*(1.0 + Math.sin(x*0.05)) val fEst = List[(Double=>Double, String)] (
     ((x: Double) => 0.2*x, "y=x/5"),
     ((x: Double) => 0.0003*x*x + 0.18*x, "y=3e-4.x^2-0.18x"),
     ((x: Double) =>0.2*x*(1+Math.sin(x*0.05),
                   "y=x(1+sin(x/20))/5"))
   val emulator = new BiasVarianceEmulator[Double](emul, 200)
emulator.fit(fEst.map( _._1)) match { case Some(varBias) => show(varBias) case None => ...
}

#Overfitting 	
Overfitting affects all aspects of the modeling process negatively, for example:
					
* It is a sure sign of an overly complex model, which is difficult to debug and consumes computation resources
* It makes the model representing minor fluctuations and noise
* It may discover irrelevant relationships between observed and latent features
* It has poor predictive performance 

there are well-proven solutions to reduce overfitting:		
* Increasing the size of the training set whenever possible
* Reducing noise in labeled and input data through filtering
* Decreasing the number of features using techniques such as principal components analysis
* Modeling observable and latent noised using filtering techniques such as Kalman or autoregressive models	
* Reducing inductive bias in a training set by applying cross-validation
* Penalizing extreme values for some of the model's features using regularization techniques 