# 1. Machine Lerning aplicada à predição de Câncer de Pele

## 1.1 Introdução do problema

O problema necessita de aprendizagem supervisionada. É uma atividade de classificação. 

Batch Learning ou online Learning? 

Serão testados diferentes algoritmos supervisionados. A medida de desempenho utilizada será a acurácia, especificidades e sensibilidade (matriz de confusão).

## 1.2 Get Data



Informar a descrição dessas variáveis.

In [0]:
import pandas as pd

# read the dataset to a Pandas' dataframe
data = pd.read_csv("DadosGeral.txt", sep='\t')

# data shape
print(data.shape)
data.head()

In [0]:
data.info()

In [0]:
# Quantidade de dados por tipo de diagnóstico

data.clinical_diagnosis.value_counts()

In [0]:
data.describe()

## 1.3 Clean, Prepare & Manipulate Data


the values in the maximum_nights and number_of_reviews columns span much larger ranges. For example, the maximum_nights column has values as low as 27 and high as 1125, in the first few rows itself. If we use these 2 columns as part of a k-nearest neighbors model, these attributes could end up having an outsized effect on the distance calculations because of the largeness of the values. To prevent any single column from having too much of an impact on the distance, we can normalize all of the columns to have a mean of 0 and a standard deviation of 1.

Normalizing the values in each columns to the [standard normal distribution](https://en.wikipedia.org/wiki/Normal_distribution#Standard_normal_distribution) (mean of 0, standard deviation of 1) preserves the distribution of the values in each column while aligning the scales. To normalize the values in a column to the standard normal distribution, you need to:

- from each value, subtract the mean of the column
- divide each value by the standard deviation of the column

In [0]:
data_aux = data.drop('clinical_diagnosis', axis=1)
data_aux.describe()

In [0]:
from sklearn.preprocessing import StandardScaler

# apply z-score (mean=0, std=1)
normalized = pd.DataFrame(StandardScaler().fit_transform(data_aux),
                            columns=data_aux.columns,
                            index=data_aux.index)
normalized.describe()

In [0]:
# Exploratory Data Analysis
# Identfy the KDE shape for all columns (gaussian distribution) after outlier elimination
import matplotlib.pyplot as plt
normalized.plot(kind='density',
                        layout=(10,2),
                        subplots=True,
                        figsize=(25,25),
                        sharex=False)
plt.show()

When numbers are used to represent different options or categories, they are referred to as **categorical values**. Classification focuses on estimating the relationship between the independent variables (x) and the dependent (y), **categorical variable**.

A coluna com a classificação do nevo precisa ser codificiada, visto que, é uma variável categórica que precisa ser convertida para valor númerico.

In [0]:
col = pd.Categorical(data["clinical_diagnosis"])
data_aux["clinical_diagnosis"] = col.codes
data_aux.head(200)

In [0]:
corr_matrix = data_aux.corr()
corr_matrix["clinical_diagnosis"].\
  sort_values(ascending=False)

In [0]:
data_aux.head()

Three beneits of performing feature selection
before modeling your data are:

- **Reduces Overfitting**: Less redundant data means less opportunity to make decisions based on noise.
- **Improves Accuracy**: Less misleading data means modeling accuracy improves.
- **Reduces Training Time**: Less data means that algorithms train faster.

## 1.4 Train Model

**Load Libraries**

In [0]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier
import time
from sklearn.pipeline import Pipeline
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier
from google.colab import files
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.datasets import load_digits
from sklearn.neural_network import MLPClassifier

**Split data into train and test**

In [0]:
# Split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(data_aux.drop('clinical_diagnosis', axis=1), 
                                                    data_aux.clinical_diagnosis,
                                                    test_size=0.20, 
                                                    random_state=42)

In [0]:
# global variables
seed = 42
num_folds = 10
scoring = {'Accuracy': make_scorer(accuracy_score)}

**Training using a Pipeline and Gridsearch**

In [0]:
# A single Pipeline
pipe = Pipeline(steps = [("clf",MLPClassifier())])

# create a dictionary with the hyperparameters
search_space = [
                {"clf":[MLPClassifier()],
                 "clf__hidden_layer_sizes": [(120,240),(120,480,120),(120,240,480,240,120)],
                 "clf__activation": ["logistic","relu"],
                 "clf__solver": ["sgd"],
                 "clf__max_iter": [50000],
                 "clf__early_stopping":[True],
                 "clf__n_iter_no_change":[20],
                 "clf__validation_fraction":[0.20], 
                 }
                ]

# create grid search
kfold = StratifiedKFold(n_splits=num_folds,random_state=seed)

# return_train_score=True
# official documentation: "computing the scores on the training set can be
# computationally expensive and is not strictly required to
# select the parameters that yield the best generalization performance".
grid = GridSearchCV(estimator=pipe, 
                    param_grid=search_space,
                    cv=kfold,
                    scoring=scoring,
                    return_train_score=True,
                    n_jobs=-1,
                    refit="Accuracy")

tmp = time.time()

# fit grid search
best_model = grid.fit(X_train,y_train)

print("CPU Training Time: %s seconds" % (str(time.time() - tmp)))

In [0]:
print("Best: %f using %s" % (best_model.best_score_,best_model.best_params_))

In [0]:
result = pd.DataFrame(best_model.cv_results_)

In [0]:
result_acc = result[['mean_train_Accuracy', 'std_train_Accuracy',
                     'mean_test_Accuracy', 'std_test_Accuracy','rank_test_Accuracy',"param_clf__hidden_layer_sizes"]].copy()
result_acc["std_ratio"] = result_acc.std_test_Accuracy/result_acc.std_train_Accuracy
result_acc.sort_values(by="rank_test_Accuracy",ascending=True)

## 1.5 Test Data

**Holdout validation** is actually a specific example of a larger class of validation techniques called **k-fold cross-validation**. While holdout validation is better than train/test validation because the model isn't repeatedly biased towards a specific subset of the data, both models that are trained only use half the available data. K-fold cross validation, on the other hand, takes advantage of a larger proportion of the data during training while still rotating through different subsets of the data to avoid the issues of train/test validation.

In [0]:
# best model
predict_first = best_model.best_estimator_.predict(X_test)
print(accuracy_score(y_test, predict_first))