<a href="https://colab.research.google.com/github/rodsei/pattern-recognition/blob/main/Reconhecimento_de_padr%C3%B5es_Ensembles_Tarefa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Atividade - Ensembles

Em um Jupyter Notebook:

- Use uma base de dados para classificação com pelo menos 1000 amostras;
- Escolha pelo menos três algoritmos de classificação;
- Combine os classificadores de duas formas diferentes:
  - Voting
  - Stacking
- Use gridsearch (ou randomsearch) para ajustar tantos os classificadores fracos quanto o ensemble;
- Coloque o Jupyter Notebook no seu GitHub.

A entrega da tarefa é um link para o Jupyter Notebook no seu GitHub.

# Descritivos

## Tipos de Ensemble

Votação
  - Tipos de Votação: 
    - Hard 
    - Soft
  - Tipos de algoritmos:
    - Bagging
    - Boosting

Stacking

## Como alcançar a diversidade

- Algoritmos diferentes
- Algoritmos com Hiperparâmetros diferentes
- Dados de treinamento diferentes
  - Amostras diferentes
  - Características diferentes
  - Amostras e características diferentes


# Execução da Atividade

## Download Database

In [18]:
from sklearn.datasets import fetch_lfw_pairs
dataset = fetch_lfw_pairs()
X, y = dataset.data, dataset.target
X.shape, y.shape

((2200, 5828), (2200,))

In [19]:
set(y), y.dtype

({0, 1}, dtype('int64'))

## Split Train & Test

In [21]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((1650, 5828), (550, 5828), (1650,), (550,))

## Classificadores

In [22]:
from sklearn.model_selection import GridSearchCV

### KNN

In [29]:
from sklearn.neighbors import KNeighborsClassifier

model_knn = KNeighborsClassifier()
model_knn.fit(X_train, y_train)

knn_predict = model_knn.predict(X_test)

knn_hits = knn_predict == y_test
# knn_hits, 
sum(knn_hits)/len(knn_hits)

0.6036363636363636

### Random Forest

In [28]:
from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier()
model_rf.fit(X_train, y_train)

rf_predict = model_rf.predict(X_test)

rf_hits = rf_predict == y_test
# rf_hits, 
sum(rf_hits)/len(rf_hits)

0.6436363636363637

### Decision Tree

In [30]:
from sklearn.tree import DecisionTreeClassifier

model_tree = DecisionTreeClassifier()
model_tree.fit(X_train, y_train)

tree_predict = model_tree.predict(X_test)

tree_hits = tree_predict == y_test
# tree_hits, 
sum(tree_hits)/len(tree_hits)


0.5563636363636364

### Gaussian Naive Bayes

In [35]:
from sklearn.naive_bayes import GaussianNB

# print(GaussianNB().get_params().keys())

gnb_params = {
    'priors': [None],
    'var_smoothing': [1e-09, 1e-07, 1e-05, 1e-03, 1e-01]
}

model_gnb = GridSearchCV(GaussianNB(), param_grid=gnb_params, cv=10, 
                      scoring="accuracy", n_jobs= 30, verbose = 1)

model_gnb.fit(X_train, y_train)

gnb_predict = model_gnb.best_estimator_.predict(X_test)


gnb_hits = gnb_predict == y_test
# gnb_hits,
sum(gnb_hits)/len(gnb_hits)

Fitting 10 folds for each of 5 candidates, totalling 50 fits


[Parallel(n_jobs=30)]: Using backend LokyBackend with 30 concurrent workers.
[Parallel(n_jobs=30)]: Done  42 out of  50 | elapsed:    5.0s remaining:    1.0s
[Parallel(n_jobs=30)]: Done  50 out of  50 | elapsed:    5.3s finished


0.5272727272727272

### Perceptron

In [34]:
from sklearn.linear_model import Perceptron
# print(Perceptron().get_params().keys())

per_params = {
    'penalty': ['l2','l1','elasticnet',None],
    'shuffle': [True, False]
}

model_per = GridSearchCV(Perceptron(), param_grid=per_params, cv=3, 
                      scoring="accuracy", n_jobs= 30, verbose=1)

model_per.fit(X_train, y_train)

per_predict = model_per.predict(X_test)
per_hits = per_predict == y_test
# per_hits, 
sum(per_hits)/len(per_hits)

dict_keys(['alpha', 'class_weight', 'early_stopping', 'eta0', 'fit_intercept', 'max_iter', 'n_iter_no_change', 'n_jobs', 'penalty', 'random_state', 'shuffle', 'tol', 'validation_fraction', 'verbose', 'warm_start'])
Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=30)]: Using backend LokyBackend with 30 concurrent workers.
[Parallel(n_jobs=30)]: Done  15 out of  24 | elapsed:   26.2s remaining:   15.7s
[Parallel(n_jobs=30)]: Done  24 out of  24 | elapsed:   32.5s finished


0.5527272727272727

## Combinando os classificadores

### Voting

In [39]:
from sklearn.ensemble import VotingClassifier

voting_params = {
    'knn__n_neighbors': [3, 5],
    'random_forest__n_estimators': [10,20],
    'random_forest__max_features': [0.15, 0.3],
    'random_forest__random_state': [7],
    'decision_tree__splitter': ['random'],
    'decision_tree__max_features': [0.15, 0.3],
    'decision_tree__random_state': [7]
}

model_voting = VotingClassifier([
    ('knn', KNeighborsClassifier()),
    ('random_forest', RandomForestClassifier()),
    ('decision_tree', DecisionTreeClassifier())
],n_jobs=35)

model_voting = GridSearchCV(model_voting, param_grid=voting_params, cv=3,
                      scoring="accuracy", n_jobs= 35, verbose=1)

model_voting.fit(X_train, y_train)

voting_predict = model_voting.predict(X_test)

voting_hits = voting_predict == y_test
# voting_hits, 
sum(voting_hits)/len(voting_hits)

Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=35)]: Using backend LokyBackend with 35 concurrent workers.
[Parallel(n_jobs=35)]: Done  28 out of  48 | elapsed: 10.7min remaining:  7.6min
[Parallel(n_jobs=35)]: Done  48 out of  48 | elapsed: 13.6min finished


0.6036363636363636

### Stacking

In [36]:
from sklearn.ensemble import StackingClassifier


stack_params = {
    'passthrough': [True, False],
    'knn__n_neighbors': [3, 5],
    # 'knn__n_jobs': [4],
    'random_forest__n_estimators': [10,20],
    'random_forest__max_features': [0.15, 0.3],
    # 'random_forest__n_jobs': [4],
    'random_forest__random_state': [7],
    'decision_tree__splitter': ['random'],
    'decision_tree__max_features': [0.15, 0.3],
    'decision_tree__random_state': [7]
}

model_stack = StackingClassifier([
    ('knn', KNeighborsClassifier()),
    ('random_forest', RandomForestClassifier()),
    ('decision_tree', DecisionTreeClassifier())
], verbose= 1)



model_stack = GridSearchCV(model_stack, param_grid=stack_params, cv=3, 
                      scoring="accuracy", n_jobs= 35, verbose=1)

model_stack.fit(X_train, y_train) 

stack_predict = model_stack.predict(X_test)

stack_hits = stack_predict == y_test
# stack_hits, 
sum(stack_hits)/len(stack_hits)

Fitting 3 folds for each of 32 candidates, totalling 96 fits


[Parallel(n_jobs=16)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  18 tasks      | elapsed: 24.2min
[Parallel(n_jobs=16)]: Done  96 out of  96 | elapsed: 102.3min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   34.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  1.3min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.5s finished


0.6090909090909091