# Trabalho prático AC

## Trabalho realizado por:

Margarida Vila-chã, 20210504

Miguel Duarte Silva, 20210504

Sebastião Lessa, 202103238


## Algoritmo escolhido e breve explicação:

O algoritmo escolhido pelo nosso grupo foi o Random Forrest.

Este algoritmo é um algoritmo de aprendizagem computacional supervisionado. Ele usa uma combinação de árvores de decisão aleatórias. O processo começa com a criação de um grande número de árvores (floresta), em que cada uma das árvores criadas é treinada com uma amostra diferente. Isto permite a criação de árvores suficientemente diferentes, o que leva a que elas juntas consigam lidar com uma ampla variedade de situações, reduzindo o risco de overffitting.

Baseamos-nos nos seguintes data-sets:

- [credit-g](https://www.openml.org/search?type=run&id=591&collections.id=99&run_task.task_id=31&sort=date)
- [breast-w](https://www.openml.org/search?type=run&id=7413&collections.id=99&run_task.task_id=15)
- [qsar-biodeg](https://www.openml.org/search?type=run&id=519587&collections.id=99&run_task.task_id=9957)

## Alteração escolhida
A alteração que o nosso grupo decidiu fazer no codigo fonte obtido foi a de considerar a utilização de outros tipos de critérios de divisão. O critério pré definido no nosso código fonte é o de entropia, por isso vamos testar com o gini e com o classification error.

## Import das bibliotecas necessárias:

In [61]:
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from random import random

from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

## Modelo implementado, sem alteração

In [62]:
class DecisionTree:
    def __init__(self, criterion, max_features=None):
        self.criterion = criterion
        self.max_features = max_features
        self.tree = None

    def fit(self, X, y):
        self.tree = self._build_tree(X, y)

    def _build_tree(self, X, y):
        if len(np.unique(y)) == 1:
            return {'class': y[0]}

        if X.shape[1] <= self.max_features:
            features = range(X.shape[1])
        else:
            features = np.random.choice(X.shape[1], self.max_features, replace=False)

        best_split_feature = None
        best_split_value = None
        best_gain = -np.inf

        for feature in features:
            if np.issubdtype(X[:, feature].dtype, np.number):
                unique_values = np.unique(X[:, feature])
                for value in unique_values:
                    left_indices = X[:, feature] <= value
                    right_indices = X[:, feature] > value
                    left_y = y[left_indices]
                    right_y = y[right_indices]
                    gain = self._compute_gain(y, left_y, right_y)
                    if gain > best_gain:
                        best_split_feature = feature
                        best_split_value = value
                        best_gain = gain
            else:
                unique_values = np.unique(X[:, feature])
                for value in unique_values:
                    indices = X[:, feature] == value
                    subset_y = y[indices]
                    gain = self._compute_gain(y, subset_y, subset_y)
                    if gain > best_gain:
                        best_split_feature = feature
                        best_split_value = value
                        best_gain = gain

        if best_gain == 0:
            return {'class': self._get_most_common_class(y)}

        if np.issubdtype(X[:, best_split_feature].dtype, np.number):
            left_indices = X[:, best_split_feature] <= best_split_value
            right_indices = X[:, best_split_feature] > best_split_value
            left_X, left_y = X[left_indices], y[left_indices]
            right_X, right_y = X[right_indices], y[right_indices]
        else:
            left_indices = X[:, best_split_feature] == best_split_value
            right_indices = X[:, best_split_feature] != best_split_value
            left_X, left_y = X[left_indices], y[left_indices]
            right_X, right_y = X[right_indices], y[right_indices]

        node = {
            'feature': best_split_feature,
            'value': best_split_value,
            'left': self._build_tree(left_X, left_y),
            'right': self._build_tree(right_X, right_y)
        }
        return node

    def _compute_gain(self, parent_y, left_y, right_y):
        parent_score = self._compute_score(parent_y)
        left_score = self._compute_score(left_y)
        right_score = self._compute_score(right_y)
        n = len(parent_y)
        left_ratio = len(left_y) / n
        right_ratio = len(right_y) / n
        gain = parent_score - (left_ratio * left_score + right_ratio * right_score)
        return gain

    def _compute_score(self, y):
        if self.criterion == "gini":
            return self._compute_gini(y)
        elif self.criterion == "entropy":
            return self._compute_entropy(y)
        elif self.criterion == "log_loss":
            return self._compute_log_loss(y)
        else:
            raise ValueError("Invalid criterion specified.")

    def _compute_gini(self, y):
        _, counts = np.unique(y, return_counts=True)
        probabilities = counts / len(y)
        gini = 1 - np.sum(probabilities ** 2)
        return gini

    def _compute_entropy(self, y):
        _, counts = np.unique(y, return_counts=True)
        probabilities = counts / len(y)
        entropy = -np.sum(probabilities * np.log2(probabilities))
        return entropy

    def _compute_log_loss(self, y):
        _, counts = np.unique(y, return_counts=True)
        probabilities = counts / len(y)
        log_loss = -np.sum(probabilities * np.log2(probabilities + 1e-10))
        return log_loss

    def _get_most_common_class(self, y):
        _, counts = np.unique(y, return_counts=True)
        most_common_index = np.argmax(counts)
        return y[most_common_index]

    def predict(self, X):
        return np.array([self._traverse_tree(x, self.tree) for x in X])

    def _traverse_tree(self, x, node):
        if 'class' in node:
            return node['class']

        feature = node['feature']
        value = node['value']
        # if np.issubdtype(x[features].dtype, np.number): #FIXME: isto foi o que o chatgpt deu, eu tirei '[features]' e funcionou mas esta demorado e lento e nao dá um score bom

        if np.issubdtype(x.dtype, np.number):
            if x[feature] <= value:
                return self._traverse_tree(x, node['left'])
            else:
                return self._traverse_tree(x, node['right'])
        else:
            if x[feature] == value:
                return self._traverse_tree(x, node['left'])
            else:
                return self._traverse_tree(x, node['right'])


In [63]:
def best_criterion(X, y, criteria):
    best_score = 0.0
    crit = None
    for criterion in criteria:
        X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3, random_state=42)
        tree = DecisionTree(criterion=criterion)
        tree.fit(X_train, y_train)
        y_pred = tree.predict(X_test)
        score = accuracy_score(y, y_pred)
        if score > best_score:
            best_score = score
            crit=criterion
    return crit

In [193]:
class Custom_RF:
    def __init__(self, n_estimators=100, max_features=None):
        self.n_estimators = n_estimators
        self.max_features = max_features
        self.estimators = []

    def fit(self, X, y):
        self.estimators = []
        num_features = X.shape[1]
        self.max_features = self._get_max_features(num_features)

        criteria = ["gini", "entropy","log_loss"]

        for _ in range(self.n_estimators):
            tree = DecisionTree(best_criterion(X, y, criteria), self.max_features)
            tree.fit(X, y)
            self.estimators.append(tree)

    def _get_best_criterion(self, X, y, criteria):
        best_criterion = None
        best_score = -np.inf

        for criterion in criteria:
            tree = DecisionTree(criterion, self.max_features)
            score = tree.compute_score(y)
            if score > best_score:
                best_criterion = criterion
                best_score = score

        return best_criterion

    def predict(self, X):
        predictions = np.array([tree.predict(X) for tree in self.estimators])
        return np.array([Counter(pred).most_common(1)[0][0] for pred in predictions.T])

    def _get_max_features(self, num_features):
        if self.max_features is None:
            return num_features
        elif isinstance(self.max_features, int):
            return min(num_features, self.max_features)
        elif isinstance(self.max_features, float):
            return int(num_features * self.max_features)
        else:
            raise ValueError("Invalid value for max_features. Must be int or float.")


## Testes para cada dataset

In [194]:
def test_model(dataset, model):

    X=dataset.iloc[:,1:-1].values
    y=dataset.iloc[:,-1].values

    X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3, random_state=42)

    model.fit(X_train, y_train)

    print("Accuracy:", accuracy_score(y_test, model.predict(X_test)))

### DATASET 1 - Cancro da mama

In [66]:
df = pd.read_csv('breast.csv')
df.head()

Unnamed: 0,id,Clump_Thickness,Cell_Size_Uniformity,Cell_Shape_Uniformity,Marginal_Adhesion,Single_Epi_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,1,5,1,1,1,2,1,3,1,1,0
1,2,5,4,4,5,7,10,3,2,1,0
2,3,3,1,1,1,2,2,3,1,1,0
3,4,6,8,8,1,3,4,3,7,1,0
4,5,4,1,1,3,2,1,3,1,1,0


In [196]:
df['Class']=df['Class'].map({'benign': 0, 'malignant': 1})
df['Bare_Nuclei'].replace('?', np.nan, inplace=True)
df.dropna(subset=['Bare_Nuclei'], inplace=True)
df.head()

Unnamed: 0,id,Clump_Thickness,Cell_Size_Uniformity,Cell_Shape_Uniformity,Marginal_Adhesion,Single_Epi_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Class
0,1,5,1,1,1,2,1,3,1,1,0
1,2,5,4,4,5,7,10,3,2,1,0
2,3,3,1,1,1,2,2,3,1,1,0
3,4,6,8,8,1,3,4,3,7,1,0
4,5,4,1,1,3,2,1,3,1,1,0


#### Teste 1 - Random Forest (sklearn) com o critério default (gini)

In [197]:
test_model(df, RF())

Accuracy: 0.9658536585365853


#### Teste 2 - Random Forest (sklearn) com o critério entropy

In [198]:
test_model(df, RF(criterion='entropy'))

Accuracy: 0.9560975609756097


#### Teste 3 - Random Forest (sklearn) com o critério log_loss

In [199]:
test_model(df, RF(criterion='log_loss'))

Accuracy: 0.9560975609756097


#### Teste 4 - Random Forest (Custom)

In [200]:
test_model(df, Custom_RF())

TypeError: '<=' not supported between instances of 'int' and 'NoneType'

## Dataset 2 - Biodeg

In [None]:
df = pd.read_csv('biodeg.csv')
df.head()

#### Teste 1 - Random Forest (sklearn) com o critério default (gini)


In [None]:
test_model(df, RF())


#### Teste 2 - Random Forest (sklearn) com o critério entropy

In [None]:
test_model(df, RF(criterion='entropy'))

#### Teste 3 - Random Forest (sklearn) com o critério log_loss

In [None]:
test_model(df, RF(criterion='log_loss'))

#### Teste 4 - Random Forest (Custom)

In [None]:
test_model(df, Custom_RF())

## Dataset 2 - Diabetes

In [None]:
df = pd.read_csv('diabetes.csv')
df.head()

In [None]:
df["'class'"]=df["'class'"].map({'tested_positive': 0, 'tested_negative': 1})
df.head()

#### Teste 1 - Random Forest (sklearn) com o critério default (gini)


In [None]:
test_model(df, RF())

#### Teste 2 - Random Forest (sklearn) com o critério entropy


In [None]:
test_model(df, RF(criterion='entropy'))


#### Teste 3 - Random Forest (sklearn) com o critério log_loss


In [None]:
test_model(df, RF(criterion='log_loss'))


#### Teste 4 - Random Forest (Custom)


In [None]:
test_model(df, Custom_RF())