# Empirical risk minimisation

In [2]:
import numpy as np
import numpy.random as random
import pandas as pd

from pandas import Series
from pandas import DataFrame
from typing import List

## I.  Majority voting algorithm

* Our implementation corresponds to `sklearn` prediction API:
  * constructor for fixing free hyperparameters
  * method `fit(samples, targets)` to train the model
  * method `predict(samples)` to predict labels
  * method `set_params(...)` to set hyperparameters  

In [3]:
class MajorityVoting:
    
    def __init__(self, features:List[str]=None):
        if features:
            self.features = list(features)
        else:
            self.features = None
    
    def set_params(features: List[str]) -> None:
        self.features = features
    
    def fit(self, X: DataFrame, y: Series) -> None:
        
        if self.features is None:
            self.features = list(X.columns.values)

        data = X.assign(y = y)
        pred = data.groupby(self.features).aggregate(['count', 'sum'])
        pred.columns = pred.columns.droplevel(0)
        self.pred = DataFrame({'prediction':(pred['sum']/pred['count'] >= 0.5)})
    
    def predict(self, X: DataFrame) -> np.array:
        
        return (X[self.features]
                .join(self.pred, on=self.features, how='left')['prediction']
                .fillna(True)
                .values)

# Homeworks

## 2.1 Classifier that minimises empirical risk (<font color='red'>1p</font>)

Given enough information about future data samples, it is possible to find a class with optimal accuracy.
* Extend `MajorityVoting` algorithm for multi-label classification task and apply it to the data frame `data` below.
* Predict `z` for  `x` and `y` and show the corresponding table of rules.
* What is the corresponding risk if it is defined as the probability of misclassification on `data`? 

In [5]:
data = (DataFrame([(0, 0, 0), (0, 0, 1), (0, 0, 1), (0, 1, 2), (0, 1, 2),
                  (1, 0, 1), (1, 0, 0), (1, 0, 2), (2, 0, 1), 
                  (2, 1, 0), (2, 1, 0), (2, 1, 0), 
                  (3, 1, 1), (3, 1, 1), (3, 1, 1), (3, 1, 2)], columns = ['x', 'y', 'z'])
        .sample(frac=1).reset_index(drop = True))
display(data.head())

Unnamed: 0,x,y,z
0,1,0,2
1,1,0,0
2,3,1,1
3,3,1,1
4,3,1,2


## 2.2. Theoretical analysis of majority voting$^*$ (<font color='red'>3p</font>) 

Explain why the training accuracy is so high for majority voting. 
You can give a theoretical answer or design an experiment to answer the following questions. 
You can consider the extreme case where the features $x_i\in\{0,1\}$ and labels $y\in\{0,1\}$ are sampled randomly. 

* Give a rough estimate how many samples are needed to arrive to the situation where training error is roughly the same as test error. 
 
* How does the sample size depend on the number of dimensions? 
* What changes if some feature values are more probable than the others? 