__Chapter 5 - Compressing Data via Dimensionality Reduction__

1. [Learning with ensembles](#Learning-with-ensembles)
    1. [Implementing a simple majority vote classifier](#Implementing-a-simple-majority-vote-classifier)
    1. [Homegrown implementation](#Homegrown-implementation-ens)
1. [](#)
1. [](#)
1. [](#)
1. [](#)


In [2]:
# Standard libary and settings
import os
import sys
import importlib
import itertools
import warnings; warnings.simplefilter('ignore')
dataPath = os.path.abspath(os.path.join('../../Data'))
modulePath = os.path.abspath(os.path.join('../../CustomModules'))
sys.path.append(modulePath) if modulePath not in sys.path else None
from IPython.core.display import display, HTML; display(HTML("<style>.container { width:95% !important; }</style>"))


# Data extensions and settings
import numpy as np
np.set_printoptions(threshold = np.inf, suppress = True)
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.options.display.float_format = '{:,.6f}'.format


# Modeling extensions
import sklearn.base as base
import sklearn.cluster as cluster
import sklearn.datasets as datasets
import sklearn.decomposition as decomposition
import sklearn.discriminant_analysis as discriminant_analysis
import sklearn.ensemble as ensemble
import sklearn.feature_extraction as feature_extraction
import sklearn.feature_selection as feature_selection
import sklearn.linear_model as linear_model
import sklearn.metrics as metrics
import sklearn.model_selection as model_selection
import sklearn.neighbors as neighbors
import sklearn.pipeline as pipeline
import sklearn.preprocessing as preprocessing
import sklearn.svm as svm
import sklearn.tree as tree
import sklearn.utils as utils


# Visualization extensions and settings
import seaborn as sns
import matplotlib.pyplot as plt


# Custom extensions and settings
from quickplot import qp, qpUtil, qpStyle
from mlTools import powerGridSearch
sns.set(rc = qpStyle.rcGrey)


# Magic functions
%matplotlib inline


<a id = 'Learning-with-ensembles'></a>

# Learning with ensembles

An ensemble is a meta-classifier that consists of several different models tuned trained on a problem. Rather than evaluating the performance of one model on its own, an ensemble enlists all models to evaluate the test data and predictions are made based on majority voting. The aim is to achieve better generalization through all models being used simultaneously.

For binary class labels, majority vote among the ensemble methods determines the prediction. For multi-class labels, plurality (the most votes, or mode) determines the prediction.

$$
\hat{y} = mode\{C_1(x), C_2(x),...,C_m(x)\}
$$

where $C_i$ is one of $m$ distinct models, and each makes a prediction for obseration $x$.

<a id = 'Implementing-a-simple-majority-vote-classifier'></a>

## Implementing a simple majority vote classifier

It's possible to implement a majority vote classifier where each model has an associated weight pertaining to our confidence in that model.

$$
\hat{y} = \mbox{arg max}\sum_{j=1}^{m}w_j\chi_A(C_j(x)=i)
$$

$w_j$ is the weight associated with model $C_j$, $\hat{y}$ is the predicted class label of the whole ensemble, $\chi_A$ is the characteristic function $[C_j(x)=i\in A]$, and $A$ is the set containing the unique class labels. If all model weights are equal, then the equation is effectively the same as the 'mode' function above.

As an example, we have 3 base classifiers and we're predicting the class label of a single instance $\textbf{x}$. $C_1$ and $C_2$ predict class 0, and $C_3$ predicts class 1. Under equal weighting:

$$
\hat{y} = mode\{0,0,1\} = 0
$$

Now let's model $C_3$ a weight of 0.6 and models $C_1$ and $C_2$ weights of 0.2

$$
\hat{y} = \mbox{arg max} [0.2 \times i_0 + 0.2 \times i_0 + 0.6 \times i_i] = 1
$$

In [3]:
# 

np.argmax(np.bincount([0,0,1], weights = [0.2, 0.2, 0.6]))


1

Additionally, ensembles can consider probabilities associated with predicted class labels returned by predict_probas.

$$
\hat{y} = \mbox{arg max}\sum_{j=1}^{m}w_jp_{ij}
$$

where $p_{ij}$ is the predicted probability of the $j$th classifier for class label $i$.

Let's again use our example where we have a binary classification problem with labels $i = \{0,1\}$ and an ensembles of three classifiers $C_j(j = \{0,1,2\})$. Suppose one sample $\textbf{x}$ return the following probablities for each of the 3 classifiers:

$$
C_1(x) \rightarrow [0.9,0.1], C_2(x) \rightarrow [0.8,0.2], C_3(x) \rightarrow [0.4,0.4 ],
$$

The individual class probabilities are then calculated as:
$$
p(i_0|\textbf{x}) = 0.2 \times 0.9 + 0.2 \times 0.8 + 0.6 \times 0.4 = 0.58
$$
$$
p(i_1|\textbf{x}) = 0.2 \times 0.1 + 0.2 \times 0.2 + 0.6 \times 0.6 = 0.42
$$
$$
\hat{y} = \mbox{arg max} [p(i_0|\textbf{x}),p(i_1|\textbf{x})] = 0
$$

In [9]:
# 

ex = np.array([[0.9,0.1]
             ,[0.8,0.2]
             ,[0.4,0.6]])
p = np.average(ex, axis = 0, weights = [0.2, 0.2, 0.6])
print(p)
print(np.argmax(p))


[0.58 0.42]
0


<a id = 'Homegrown-implementation-ens'></a>

## Homegrown implementation

In [12]:
#

import sklearn.externals as externals
import operator

class MajorityVoteClassifier(base.BaseEstimator, base.ClassifierMixin):
    
    
    def __init__(self, classifiers, vote = 'classlabel', weights = None):
        """
        Info:
            Description:
                Classifier for performing ensemble majority vote.
            Parameters:
                classifiers : array
                    Classifier of the ensemble
                vote : str
                    a
                weights : array
                    a
        """
        self.classifiers = classifiers
        self.named_classifiers = {key : value for key, value in _name_estimators(classifiers)}
        self.vote = vote
        self.weights = weights
        
    def fit(self, X, y):
        """
        Info:
            Description:
                Fit classifiers.
            Parameters:
                X : array
                    Training data
                y : array
                    Labels for training data
        """
        self.lablenc_ = preprocessing.LabelEncoder()
        self.lablenc_.fit(y)
        self.classes_ = self.lablenc_.classes_
        self.classifiers_ = []
        for clf in self.classifiers:
            fitted_clf = clone(clf).fit(X, self.lablenc_.transform(y))
            self.classifiers_.append(fitted_clf)
        return self
        
    def predict(self, X):
        """
        Info:
            Description:
                Generate predictions for X.
            Parameters:
                X : array
                    Training data
        """
        if self.vote == 'probability':
            maj_vote = np.argmax(self.predict_proba(X), axis = 1)
        else:
            predictions = np.asarray([clf.predict(X) for clf in clf.classifiers_]).T
            
            maj_vote = np.apply_along_axis(lambda x: \
                                           np.argmax(np.bincount(x, weights = self.weights))
                                          ,axis = 1
                                          ,arr = predictions)
            maj_vote = self.lablenc_.inverse_transform(maj_vote)
            return maj_vote
        
            
    def predict_proba(self, X):
        """
        Info:
            Description:
                Predict class probabilities for X.
            Parameters:
                X : array
                    Training data
        """
        probas = np.asarray([clf.predict_proba(X) for clf in self.classifiers_])
        avg_proba = np.average(probas, axis = 0, weights = self.weights)
        return avg_proba
        
    def get_params(self, deep = True):
        """
        Info:
            Description:
                Get classifier parameter names for GridSearch.
            Parameters:
                deep : boolean, default = True
        """
        if not deep:
            return super(MajorityVoteClassifier, self).get_params(deep = False)
        else:
            out = self.named_classifiers.copy()
            for name, step in six.iteritems(self.named_classifiers):
                for key, value in six.iteritems(step.get_params(deep = True)):
                    out['%s__$s' % (name, key)] = value
            return out
        

In [14]:
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [25]:
# Import iris dataset

iris = datasets.load_iris()
df = pd.DataFrame(np.c_[iris['data'], iris['target']]
                 ,columns = iris['feature_names'] + ['target'])

# Trim iris data set down to two classes and two features

df = df.iloc[50:, [1, 2, 4]]
#df['target'] = np.where(df['target'] == 0.0, -1, 1)
df[:5]


Unnamed: 0,sepal width (cm),petal length (cm),target
50,3.2,4.7,1.0
51,3.2,4.5,1.0
52,3.1,4.9,1.0
53,2.3,4.0,1.0
54,2.8,4.6,1.0


In [26]:
df['target'].drop_duplicates()

50    1.000000
100   2.000000
Name: target, dtype: float64

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A

<a id = ''></a>

# A