**CS596 - Machine Learning**
<br>
Date: **30 September 2020**


Title: **Lecture 6**
<br>
Speaker: **Dr. Shota Tsiskaridze**
<br>
Teaching Assistant: **Levan Sanadiradze**

Bibliography:
<br>
**Chapter 14** of Bishop, Christopher M., *Pattern Recognition and Machine Learning*, Springer, 2006 [1]

<h1 align="center">Combining Models</h1>

<h3 align="center">Ensemble Methods</h3>

- The  **improved performance** can be **obtained** by **combining multiple models** together in some way, instead of just using a single model.


- There are **various** methods for **Combining Models**, called **Ensemble Methods (EM)**:
  - **Combine** the **output** of multiple models in some way to **improve accuracy**;
  - **Combine** the **series** of $K$ learning models $M_1, ..., M_K$ in order to create an **improved model** $M$.
  

- **Popular** ensemble methods:
  - **Bagging**: avaraging the prediction over a collection of classifiers;
  - **Boosting**: weighted vote with a collection of classifiers;
  - **Ensemble**: Combining a set of heterogeneous classifiers.


  <br>
  <img src="images/L6_Ensemble_Methods.png" width="500" alt="Example" />


<h3 align="center">Bagging</h3>

- **Committees**: the **combination of $L$ different models** where the **made prediction** is the **average of the predictions made by each model**.


- In practice, we have only a **single data set**, and so we have to find a way to **introduce variability** between the **different models** within the **committee**.


- **One approach** is to use **bootstrap** data sets:

  Suppose our **original data set** consists of $N$ **data points** $X = \{x_1, \cdots , x_N\}$. 
  
  We can **create a new data set** $X_B$ by **drawing** $N$ points at **random** from $X$, **with replacement**, so that some points in $X$ may be **replicated** in $X_B$.
  
  This **process** can be **repeated $L$ times** to **generate $L$ data sets** each of **size** $N$ and each **obtained** by **sampling from the original data set** $X$. 
  
  The **statistical accuracy** of parameter estimates can then be **evaluated** by looking at the **variability of predictions** between the **created data sets**.
  
  

- Lets consider a **regression problem** in which we are **trying to predict** the **value** of a **single continuous variable**.

  Suppose we generate $M$ **bootstrap data sets** and then use each to **train a separate copy** $y_m(\mathbf{x})$ of a predictive model where $m = 1, ..., M$.

  The **prediction** of the **committee** is given by:
  
  $$y_{\mathbf{com}}(\mathbf{x}) = \frac{1}{M} \sum_{m=1}^{M} y_m(\mathbf{x}).$$
  
  This procedure is known as **bagging**.
  
  
  <br>
  <img src="images/L6_Bagging.png" width="800" alt="Example" />


<h3 align="center">Avarage Error Reduction</h3>

- Suppose the **true regression function** that we are **trying to predict** is given by $h(\mathbf{x})$, so that the **output** of each of the models **can be written** as the **true value** plus an **error** in the form:

  $$ y_m(\mathbf{x})  = h(\mathbf{x}) + \epsilon(\mathbf{x}).$$
  
  The **sum-of-squares error** then takes the form:
  
  $$\mathbb{E}_{\mathbf{x}} \left [ \{y_m(\mathbf{x})  - h(\mathbf{x})\}^2 \right ] = \mathbb{E}_{\mathbf{x}} \left [ \epsilon(\mathbf{x})^2 \right ],$$
  
  where $\mathbb{E}_{\mathbf{x}}[\cdot]$ denotes a **frequentist expectation** with respect to the distribution of the input vector $\mathbf{x}$.
  
  The **average error** made by the models acting individually is therefore:
  
  $$E_{AV} = \frac{1}{M} \sum_{m=1}^{M} \mathbb{E}_{\mathbf{x}} \left [ \epsilon(\mathbf{x})^2 \right ].$$
  
  Similarly, the **expected error** from the **committee** is given by:
  
  $$E_{com} = \mathbb{E}_{\mathbf{x}} \left [ \left \{ \frac{1}{M}  \sum_{m=1}^{M} y_m(\mathbf{x}) - h(\mathbf{x})\right \}^2 \right ] = \mathbb{E}_{\mathbf{x}} \left [ \left \{ \frac{1}{M}  \sum_{m=1}^{M} \epsilon(\mathbf{x}) \right \}^2 \right ]$$
  
  If we **assume** that the **errors have zero mean** and **are uncorrelated**, so that:
  
  $$\mathbb{E}_{\mathbf{x}} \left [ \epsilon(\mathbf{x}) \right] =  0,$$
  
  $$\mathbb{E}_{\mathbf{x}} \left [\epsilon(\mathbf{x}_m)\epsilon(\mathbf{x}_l) \right] = 0, \textrm{ for all } m\neq l,$$
  
  then we obtain:
  
  $$E_{com} = \frac{1}{M} E_{AV}.$$
 
 
- Last equation gives apparently **dramatic result** suggests that the **average error** of a model **can be reduced** by a **factor of M** simply by **averaging $M$ versions of the model**.

- Unfortunately, it depends on the **key assumption** that the errors due to the **individual models are uncorrelated**.

- **In practice**, the **errors are typically highly correlated**, and the **reduction** in overall error is **generally small**.


- However, it can be shown that the **expected committee error** will **not exceed** the **expected error** of the **constituent models**:

  $$E_{com} \leq E_{AV}$$.

<h3 align="center">Boosting</h3>

- In order to **achieve more significant improvements**, we turn to a more **sophisticated technique** for building committees, known as **boosting**.


- **Boosting** is a powerful technique for **combining multiple base classifiers** to produce a form of **committee** whose performance can be **significantly better** than that of **any** of the **base classifiers**.


- **Boosting** can give **good results** even if the **base classifiers have a performance** that is only **slightly better than random**, and hence sometimes the **base classifiers** are known as **weak learners**.


- **Boosting** involves **training multiple models** in **sequence** in which the **error function** used to train a **particular model** depends on the **performance of the previous models**.


- Most widely used form of **boosting** algorithm is **AdaBoost**, short for **Adaptive Boosting**.


- Lets consider a **two-class classification problem**:
  - The **training data** comprises **input vectors** $\mathbf{x}_1, . . . , \mathbf{x}_N$ along with corresponding binary **target variables** $t_1, . . . , t_N$ where $t_n \in \{−1, 1\}$;
  - **Each data point** is given an associated **weighting parameter** $w_n$, which is **initially set** $\frac{1}{N}$ for all data points;
  - We suppose that **we have a procedure** available for **training a base classifier** using weighted data to give a function $y(\mathbf{x}) \in \{-1, 1\}$.


- The **precise form** of the **AdaBoost algorithm** is given as below.


1. **Initialize** the data **weighting coefficients** $\{w_n\}$ by setting $w_n^{(1)} = \frac{1}{N}$ for $n = 1, ..., N$.
  
  
2. For $m = 1, ..., M$ do:
   
   (a) **Fit a classifier** $y_m(\mathbf{x})$ to the training data by **minimizing** the **weighted error** function:
      $$J_m = \sum_{n=1}^{N} w_n^{(m)} I(y_m(\mathbf{x}_n) \neq t_n),$$
    
      &emsp; where $I(y_m(\mathbf{x}_n) \neq t_n)$ is the **indicator function** and equals $1$ when$y_m(\mathbf{x}_n) \neq t_n$ and $0$ otherwise.
      
    (b) **Evaluate the quantities**:
    
      $$\epsilon_m = \frac{\sum_{n=1}^{N} w_n^{(m)} I(y_m(\mathbf{x}_n) \neq t_n)}{\sum_{n=1}^{N} w_n^{(m)}}$$
      
      &emsp; and then use these to evaluate:
      
      $$\alpha_m = ln \left \{ \frac{1 - \epsilon_m}{\epsilon_m} \right \}.$$
      
    (c) **Update the data weighting coefficients**:
    
      $$w_n^{(m+1)} = w_n^{(m)} \mathbf{exp} \{ \alpha_m I(y_m(\mathbf{x}_n) \neq t_n) \}$$
  
  
3. **Make predictions** using the final model, which is given by:
  
    $$ Y_M (\mathbf{x}) = \mathrm{sign} \left( \sum_{m=1}^{M} \alpha_m y_m(\mathbf{x}) \right).$$
    

- We see that in subsequent iterations the weighting coefficients $w_n^{(m)}$ are **increased** for data points that are **misclassified** and **decreased** for data points that are **correctly classified**.
- **Successive classifiers** are therefore **forced to place greater emphasis on points that have been misclassified by previous classifiers**.

  <img src="images/L6_Boosting.png" width="500" alt="Example" />



<h3 align="center">A General Tree</h3>

- **Trees Definition:**
  - **Trees** consist of **nodes** which are connected by **edges**;
    - **Nodes** represent items of the collection;
    - **Edges** connect nodes and represent the relationship between nodes.
  - A **tree** is a **collection of nodes** that originate from a **unique starting node** called the **root**;
  - A **tree** is defined recursively, some of terms used are defined below.
    - A **single node** by itself **is a tree**;
    - Given node $n$ and trees $t_1, t_2, .., t_k$ with roots $n_1, n_2, ..., n_k$ a **new tree** may be **constructed** by making $n$ the parent of $n_1, n_2, ..., n_k$.
    
    
- **Tree Terminology**:
  - **Path** - a sequence of edges between nodes;
  - **Root Node** -  the special node from which all other nodes *descend*;
  - **Parent of Node $n$** - the unique node with an edge to node $n$ and which is the first node on the path from $n$ to the root;
  - **Child of node $n$** - a node for which node $n$ is the next node on the path to the root node.
  - **Siblings** - the nodes with the same parent.
  - **Terminal Node** - a node with no children.
  - **Internal Node** -  a nonleaf node.

<img src="images/L6_General_Tree_Structure.png" width="600" alt="Example" />

<h3 align="center">Decision Tree</h3>

- **Decision trees** used in data mining are of **two main types**:
  - **Classification** tree analysis is when the predicted outcome is the class (**discrete**) to which the data belongs;
  - **Regression** tree analysis is when the predicted outcome can be considered a **real number**.
  
  
<img src="images/L6_Decision_Tree.png" width="400" alt="Example" />

<h3 align="center">Binary Decision Tree on $R^2$</h3>

- Let $x_1$ and $x_2$ be our two dimensional **input space** and $\theta_1, ..., \theta_4$ be the **parameters of the model**:
  - The **first step** divides the whole of the input space into **two regions** according to whether $x_1 \leq \theta_1$ or $x_1 > \theta_1$.
  - The region $x_1 \leq \theta_1$ is further **subdivided** according to whether $x_2 \leq \theta_2$ or $x_2 > \theta_0$ giving rise to the regions denoted **A** and **B**.


- For any **new input $x$**, we determine which region it falls into by starting at the top of the tree at the root node and following a path down to a specific leaf node according to the decision criteria at each node.


<img src="images/L6_Binary_Decision_Tree.png" width="1000" alt="Example" />



<h3 align="center">Decision Tree for Regression Problem</h3>


- Lets consider first a **regression problem** in which the **goal** is to predict a single target variable $t$ from a $D$-dimensional vector $\mathbf{x} = (\mathbf{x_1, . . . , x}_D)^\mathrm{T}$ of input variables. 


- The training data consists of input vectors $\{\mathbf{x_1, . . . , x}_N\}$ along with the corresponding continuous labels $\{t_1, . . . , t_N\}$.


- Suppose the **leaf nodes** are indexed $p = 1, ..., P$,  with leaf node $p$ representing a region $R_p$ of input space having $N_p$ data points, and $P$ denoting the **total number of leaf nodes**.


- Therefopre, the decision tree gives the partition of $\mathbf{X} ={(\mathbf{x_1, x_2, ..., x}_n)}$ into regions $R_1, ..., R_P$.


- Recall that a partition is a **disjoint union**, therefore:

  $$X = \bigcup_{p=1}^{P} R_p$$
  
  and
  
  $$R_i \cap R_j = \Theta \textrm{ for all } i \neq j.$$ 


- Given the partition $\{R_1, R_2, ..., R_P \}$ the **optimal prediction** of for an **input $x$** is defined as:

  $$y(x) = \sum_{p=1}^{P} \tau_p \cdot \mathrm{I}(x \in R_p).$$


- To choose $\tau_1, \tau_2, ..., \tau_P$ we **minimize** the **sum-of-squares error** function:

  $$J = \sum_{n=1}^{N} \left \{  y(x_n) - t_n \right \}^2,$$

  then the **optimal value** of the predictive variable within any given region is just given by the **average of the values** of $t_n$ for those **data points** that **fall in** that **region**:

  $$\tau_p = ave(t_n | x_n \in R_p) = \frac{1}{N_p} \sum_{x_n \in R_p} t_n$$



<h3 align="center">Complexity of a Decision Tree</h3>

Lets consider how to determine the **structure** of the **decision tree**.

- Even for a **fixed number of nodes** in the tree, the **problem of determining the optimal structure** to minimize the sum-of-squares error is usually **computationally infeasible** due to the combinatorially large number of possible solutions.


- We proceed with a **Greedy** algorithm:
  - Starting with a **Root Node** and then growing the tree by **adding nodes one at a time**.  
  - At **each step** will be some **number of candidate** regions in input space that can be split.
  - The **optimal choice** of predictive variable is given by the one that gives the **smallest residual sum-of-squares error**.
  - **stop** when the reduction in **residual error falls below some threshold**.


Lets formalize this **decision step**:

- Let $v$ be the **splitting variable**: $v = \{1, ..., N\}$;
- Let $s$ be the **split point**: $s \in \mathbb{R}$
- Then, partition **based** on $v$ and $s$ will be:
  
  $$R_1(v, s) = \{x | x_v \leq s\}$$
  $$R_2(v, s) = \{x | x_v > s\}$$


- For each **splitting variable** $v$ and **split point** $s$,
  
  $$\tau_1 (v,s)  = ave (t_n | x_n \in R_1(v,s))$$
  $$\tau_2 (v,s)  = ave (t_n | x_n \in R_2(v,s))$$


- We **determine** $v$ and $s$ by **minimizing** the **sum-of-squares error** function:
  
  $$J_{(v,s)} = \sum_{x_n \in R_1(v,s)} (t_n - \tau_1 (v,s))^2 + \sum_{x_n \in R_2(v,s)} (t_n - \tau_2 (v,s))^2$$ 

<h3 align="center">Complexity Control Strategy</h3>

- When do we stop?
  - If the tree is **too big**, we may **overfit**.
  - If **too small**, we may miss patterns in the data, i.e. we may **underfit**.
  
  
- Typical approach is use **pruning**:
  - Build a **really big tree** using a **stopping criterion** based on the **number of data points** associated with the **leaf nodes**, for example until all regions have $\leq 5$ points;
  - **Prune back the resulting tree.**.
  
  
- The **pruning criterion** is then given by:

  $$Cr(P) = \sum_{p=1}^{P} J_p(P) + \lambda P,$$
  
  where $\lambda$ is the **regularization parameter** that determines the **trade-off** between the **overall residual sum-of-squares error** and the **complexity of the model** as measured by the number $P$ of **leaf nodes**.

<h3 align="center">Decision Tree for Classification Problem</h3>

- For **classification problem**, the process of growing and pruning the tree is similar, except that the **sum-of-squares error** is **replaced by** a more appropriate **measure of performance**:


- If we define $\rho_{pk}$ to be the proportion of data points in region $R_p$ assigned to class $k$, where $k = 1, ..., N$ then two commonly used choices are
  - the **cross-entropy**:
    $$ J_p(P) = \sum_{k=1}^{K} \rho_{pk} \ln \rho_{pk},$$ 
  - the **Gini index**:
    $$ J_p(P) = \sum_{k=1}^{K} \rho_{pk} (1 - \rho_{pk}).$$ 
    
- The **cross entropy** and the **Gini index** are **better measures than the misclassification rate** for growing the tree because they are more sensitive to the node probabilities. Also, unlike misclassification rate, they are **differentiable** and hence better suited to **gradient based optimization methods**.

<h3 align="center">Tree Based Methods Problems</h3>

- The **splits are aligned with the axes** of the feature space, i.e. **Decision Trees** have to **work much harder to capture linear relations**.

- Decision trees are **very sensitive to the data set**, so that a small change to the training data can result in a very different set of splits $\rightarrow$ <br>**Solution is the random forest**.


<img src="images/L6_Decision_Trees_vs_Linear_Models.png" width="800" alt="Example" />


<h3 align="center">Random Forest</h3>

- **Random forest** is an ensemble classifier that is using **many decision tree models**:
  - **Each decision tree classifier** is generated using a **random selection of attributes** at each node to **determine the split**.
  - **During classification** each **tree votes** and the **most popular class is returned**. 


- It can be used for both **classification** and **regression problems**.


- **Advantages**:
  - The ability to **efficiently process** data with a **large number** of **features** and **classes**;
  - Insensitive to scaling feature values;
  - **Both continuous** and **discrete attributes** are **equally well processed**;
  - There are methods for **constructing trees** from data with **missing attribute values**.
  - There are methods for **assessing the significance** of **individual features** in a model.
  
  
- **Disavantages**: 
  - **Random forests** have been observed to **overfit** for **some datasets** with noisy classification/regression tasks;
  - **Large number of trees** may make the **algorithm slow** for **real-time predictions**.


  <img src="images/L6_Random_Forest.png" width="600" alt="Example" />


<h1 align="center">End of Lecture</h1>