**Universidade Federal de Minas Gerais**  
**Departamento de Ciência da Computação**  
**Aprendizado de Máquina**  
**Valéria Pereira de Souza**    

## Trabalho Prático 1 - Classificação de Exoplanetas

# Apresentação do problema

Um KOI - Kepler Object of Interest é uma corpo celeste que se suspeita ser um exoplaneta. A partir da observação de diversas características, o corpo é classificado como KOI e posteriormente, após novas verificações é feita a classificação dos KOI em "Confirmado" ou "Falso positivo". O problema deste trabalho prático se trata de classificar automaticamente se um KOI é um exoplaneta confirmado ou um falso positivo.

Para tanto, iremos utilizar uma série de algoritmos classificadores a serem treinados em dados anotados. Os desempenhos serão aferidos pela métrica acurácia e posteriormente comparados para identificação do melhor resultado.

## 1. Inspeção, limpeza e preparação dos dados

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from tqdm import tqdm
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import minmax_scale
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [2]:
INPUT_FILEPATH = "koi_data.csv"
TARGET = "koi_disposition"
N_FOLDS = 5

Vamos carregar os dados e olhar algumas instâncias

In [3]:
data_raw = pd.read_csv(INPUT_FILEPATH)
data_raw.head(5)

Unnamed: 0,kepoi_name,koi_disposition,koi_period,koi_impact,koi_duration,koi_depth,koi_ror,koi_srho,koi_prad,koi_sma,...,koi_fwm_srao,koi_fwm_sdeco,koi_fwm_prao,koi_fwm_pdeco,koi_dicco_mra,koi_dicco_mdec,koi_dicco_msky,koi_dikco_mra,koi_dikco_mdec,koi_dikco_msky
0,K00752.01,CONFIRMED,9.48804,0.146,2.9575,615.8,0.02234,3.20796,2.26,0.0853,...,0.43,0.94,-0.0002,-0.00055,-0.01,0.2,0.2,0.08,0.31,0.32
1,K00752.02,CONFIRMED,54.41838,0.586,4.507,874.8,0.02795,3.02368,2.83,0.2734,...,-0.63,1.23,0.00066,-0.00105,0.39,0.0,0.39,0.49,0.12,0.5
2,K00754.01,FALSE POSITIVE,1.73695,1.276,2.40641,8079.2,0.38739,0.2208,33.46,0.0267,...,-0.111,0.002,0.00302,-0.00142,-0.249,0.147,0.289,-0.257,0.099,0.276
3,K00755.01,CONFIRMED,2.52559,0.701,1.6545,603.3,0.02406,1.98635,2.75,0.0374,...,-0.01,0.23,8e-05,-7e-05,0.03,-0.09,0.1,0.07,0.02,0.07
4,K00114.01,FALSE POSITIVE,7.36179,1.169,5.022,233.7,0.18339,0.00485,39.21,0.082,...,-13.45,24.09,0.00303,-0.00555,-4.506,7.71,8.93,-4.537,7.713,8.948


Observarmos 43 colunas, das quais uma se refere ao código de cada KOI e outra se refere à anotação do dado de interesse (y). As demais colunas parecem ser nuéricas. Vamos investigar essa informação

In [4]:
data_raw.dtypes

kepoi_name           object
koi_disposition      object
koi_period          float64
koi_impact          float64
koi_duration        float64
koi_depth           float64
koi_ror             float64
koi_srho            float64
koi_prad            float64
koi_sma             float64
koi_incl            float64
koi_teq             float64
koi_insol           float64
koi_dor             float64
koi_max_sngle_ev    float64
koi_max_mult_ev     float64
koi_model_snr       float64
koi_steff           float64
koi_slogg           float64
koi_smet            float64
koi_srad            float64
koi_smass           float64
koi_kepmag          float64
koi_gmag            float64
koi_rmag            float64
koi_imag            float64
koi_zmag            float64
koi_jmag            float64
koi_hmag            float64
koi_kmag            float64
koi_fwm_stat_sig    float64
koi_fwm_sra         float64
koi_fwm_sdec        float64
koi_fwm_srao        float64
koi_fwm_sdeco       float64
koi_fwm_prao        

De fato, todas as demais colunas são numéricas e do mesmo tipo; não haverá necessidade de tranaformações de features categóricas A coluna kepoi_name traz indentificadores únicos para cada instância. Dessa forma, não é uma feature discriminativa entre as classes e, portanto, será desconsiderada, bem como separaremos a coluna de anotação das dados.  

y conterá as respostas  
X conterá um DataFrame com todas as features originais

In [5]:
y = data_raw[TARGET]
X = data_raw.drop(columns=['kepoi_name', 'koi_disposition'])

Vamos investigar as features e verificar quais transformações serão necessárias

In [6]:
X.isnull().values.any()

False

In [7]:
X.describe()

Unnamed: 0,koi_period,koi_impact,koi_duration,koi_depth,koi_ror,koi_srho,koi_prad,koi_sma,koi_incl,koi_teq,...,koi_fwm_srao,koi_fwm_sdeco,koi_fwm_prao,koi_fwm_pdeco,koi_dicco_mra,koi_dicco_mdec,koi_dicco_msky,koi_dikco_mra,koi_dikco_mdec,koi_dikco_msky
count,5202.0,5202.0,5202.0,5202.0,5202.0,5202.0,5202.0,5202.0,5202.0,5202.0,...,5202.0,5202.0,5202.0,5202.0,5202.0,5202.0,5202.0,5202.0,5202.0,5202.0
mean,37.032237,0.717106,5.607025,21340.318993,0.235205,3.41537,112.230798,0.158146,81.181413,1143.721069,...,-0.355681,-0.805629,-0.000263,0.000439,-0.049743,-0.087413,1.930251,-0.038402,-0.098738,1.920226
std,88.417985,2.628207,6.962634,66989.80855,2.586213,25.131368,3699.799318,0.241792,16.308839,775.788868,...,10.978677,14.741473,0.065707,0.077519,2.46567,2.746534,3.147553,2.465094,2.734732,3.142764
min,0.30694,0.0,0.1046,0.8,0.00129,4e-05,0.08,0.0072,2.29,92.0,...,-275.6,-397.62,-4.0,-0.8,-21.5,-75.9,0.0,-23.6,-76.6,0.0
25%,2.213962,0.226,2.50025,176.8,0.013058,0.176092,1.46,0.033,81.93,615.25,...,-0.5,-0.57,-0.00024,-0.00024,-0.27,-0.2915,0.12825,-0.26525,-0.32,0.18
50%,7.386755,0.61,3.8055,495.95,0.024185,0.748045,2.6,0.07365,87.89,948.0,...,0.0,-0.03,0.0,0.0,0.0,0.0,0.46,-0.007,-0.018,0.453
75%,23.448117,0.92375,6.00075,2120.525,0.17126,2.267063,21.645,0.1582,89.52,1482.0,...,0.5,0.45,0.00026,0.00028,0.23,0.23,2.57,0.22625,0.25,2.42
max,1071.23262,100.806,138.54,864260.0,99.87065,918.75239,200346.0,2.0345,90.0,9791.0,...,97.78,98.78,1.19,5.0,45.68,27.5,88.6,46.57,31.2,89.6


Não existem valores faltantes.
Mas observarmos que existe grande variância na escala de valores das features. Esse fato pode representar um problema para algoritmos paramétricos uma vez que features com valores mais extremos influenciarão mais o resultado enviesando artificalmente os modelos. Para evitar esse tipo de problema, faremos um procedimento de feature scaling pela técnica de normalização no intervalo (0, 1)

In [8]:
# Normalização das features
X = pd.DataFrame(minmax_scale(X, feature_range=(0, 1)))
X.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,31,32,33,34,35,36,37,38,39,40
count,5202.0,5202.0,5202.0,5202.0,5202.0,5202.0,5202.0,5202.0,5202.0,5202.0,...,5202.0,5202.0,5202.0,5202.0,5202.0,5202.0,5202.0,5202.0,5202.0,5202.0
mean,0.034293,0.007114,0.039747,0.024691,0.002342,0.003717,0.00056,0.074457,0.899457,0.108436,...,0.737169,0.799384,0.770662,0.138007,0.319295,0.733197,0.021786,0.335779,0.709659,0.021431
std,0.082562,0.026072,0.050295,0.077511,0.025896,0.027354,0.018467,0.119268,0.18594,0.079986,...,0.029403,0.029697,0.01266,0.013365,0.036702,0.026562,0.035525,0.03513,0.025369,0.035075
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.001781,0.002242,0.017305,0.000204,0.000118,0.000192,7e-06,0.012726,0.907992,0.053949,...,0.736783,0.799859,0.770667,0.13789,0.316017,0.731223,0.001448,0.332546,0.707607,0.002009
50%,0.006611,0.006051,0.026734,0.000573,0.000229,0.000814,1.3e-05,0.032778,0.975943,0.088257,...,0.738122,0.800947,0.770713,0.137931,0.320036,0.734043,0.005192,0.336226,0.710408,0.005056
75%,0.021609,0.009164,0.042591,0.002453,0.001702,0.002468,0.000108,0.074483,0.994527,0.143314,...,0.739461,0.801914,0.770763,0.137979,0.323459,0.736267,0.029007,0.33955,0.712894,0.027009
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Por último, faremos o encoding dos labels

In [9]:
label_encoder = LabelEncoder()
label_encoder.fit(['CONFIRMED', 'FALSE POSITIVE'])
y = label_encoder.transform(y)

## 2. Experimentos

#### *Classe utilitária*

In [27]:
class Experiment():
    def __init__(self):

        self.X_train, self.X_test = [], []
        self.y_train, self.y_test = [], []
        self.train_index, self.test_index = [], []
        self.train_pred, self.test_pred = [], []
        
        self.accuracy_train, self.accuracy_test = [], []
        self.precision_train, self.precision_test = [], []
        self.recall_train, self.recall_test = [], []
        self.mean_train_accuracy = 0
        self.mean_test_accuracy = 0
        self.mean_train_precision = 0
        self.mean_test_precision = 0
        self.mean_train_recall = 0
        self.mean_test_recall = 0

    def run_decision_tree(self, depth):
        for self.train_index, self.test_index in self.get_folds():

            self.slice_fold()            

            clf = DecisionTreeClassifier(max_depth=depth)
            clf = clf.fit(self.X_train, self.y_train)

            self.get_predictions(clf)
            self.get_metrics_per_fold()
            
        self.get_mean_metrics()
        
    def run_naive_bayes(self):
        for self.train_index, self.test_index in self.get_folds():

            self.slice_fold()            

            clf = GaussianNB()
            clf = clf.fit(self.X_train, self.y_train)

            self.get_predictions(clf)
            self.get_metrics_per_fold()
            
        self.get_mean_metrics()
    
    def get_folds(self):
        kfold = KFold(N_FOLDS, True, 1)
        return kfold.split(X)
    
    def slice_fold(self):
        global X
        global y
        self.X_train = X.iloc[self.train_index]
        self.y_train = y[self.train_index]
        
        self.X_test = X.iloc[self.test_index]
        self.y_test = y[self.test_index]
        
        
    def get_predictions(self, clf):
        self.train_pred = clf.predict(self.X_train)
        self.test_pred = clf.predict(self.X_test)
        
    def get_metrics_per_fold(self):
        self.accuracy_train.append(accuracy_score(self.y_train, self.train_pred))
        self.accuracy_test.append(accuracy_score(self.y_test, self.test_pred))
        
        self.precision_train.append(precision_score(self.y_train, self.train_pred))
        self.precision_test.append(precision_score(self.y_test, self.test_pred))
        
        self.recall_train.append(recall_score(self.y_train, self.train_pred))
        self.recall_test.append(recall_score(self.y_test, self.test_pred))
    
    def get_mean_metrics(self):
        self.mean_train_accuracy = np.mean(self.accuracy_train)
        self.mean_test_accuracy = np.mean(self.accuracy_test)

        self.mean_train_precision = np.mean(self.precision_train)
        self.mean_test_precision = np.mean(self.precision_test)

        self.mean_train_recall = np.mean(self.recall_train)
        self.mean_test_recall = np.mean(self.recall_test)
    
    def print_results(self):
        print("Train accuracy: ", self.mean_train_accuracy)
        print("Test accuracy:", self.mean_test_accuracy)
        print("Train precision: ", self.mean_train_precision)
        print("Test precision: ", self.mean_test_precision)
        print("Train recall: ", self.mean_train_recall)
        print("Test recall: ", self.mean_test_recall)
    
    def confusion_matrix(self):
        #


In [133]:
# decision_tree_test = Experiment()
# decision_tree_test.run_model(run_decision_tree)

### 2. Baseline

O algoritmo Naive Bayes, no qual é calculada a probabilidade condicional de cada feature dado o rótulo observado.

In [28]:
baseline = Experiment()
baseline.run_naive_bayes()
baseline.print_results()

Train accuracy:  0.91685910714593
Test accuracy: 0.9175338062513856
Train precision:  0.9799221525989648
Test precision:  0.9795733517273257
Train recall:  0.8784121666027411
Test recall:  0.8799206286043619


## 3. Modelos

### 3.1 Decision Trees

Decision Tree é um classificador não paramétrico baseado na escolha de features que minimizem a entropia do subconjunto


[Variar a altura máxima da árvore (incluindo permitir altura ilimitada) e mostrar os resultados graficamente]


###### 3.1.1 Experimentos

In [94]:
accuracy_train, precision_train, recall_train, f1_train = [], [], [], []
accuracy_test, precision_test, recall_test, f1_test = [], [], [], []

# variando a altura da árvore
for i in tqdm(range(1, 1+1)):
    train, val = run_decision_tree(i)
    train_acc.append(train)
    val_acc.append(val)

# Experimento: altura ilimitada
mean_train_accuracy, mean_test_accuracy, mean_train_precision, mean_test_accuracy,mean_train_accuracy, mean_test_accuracy = run_decision_tree(None)
# train_acc.append(train)
# val_acc.append(val)

In [95]:
#plotar resultados pela acurária
# x = [i for i in range(1, 30+2)]
# plt.plot(x, train_acc)
# plt.show()

### 3.2 SVM

### 3.3 kNN

### 3.4 Random Forest

### 3.5 Gradient Tree Boosting