# Analysis and appliyng Machine Learning on Breast Cancer Dataset

Content:

1. [Pré processamento de Dados](#1)
1. [Aplicando XGBoost](#2)
1. [Utilizando Optuna para Hyperparameter Tuning](#3)
1. [Conclusão](#4)


In [None]:
import numpy as np
import pandas as pd
import xgboost as xgb
from xgboost import XGBClassifier, plot_importance
import sklearn
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import log_loss
from sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.model_selection import train_test_split, KFold, GroupKFold, StratifiedKFold
import optuna
from optuna import Trial, visualization
import warnings
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

# Mute warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')

In [None]:
df.head()

<a id="1"></a> 
# Pré processamento de Dados:

1 - ID number

2 - Diagnosis (M = malignant, B = benign)

3-32 - Ten real-valued features are computed for each cell nucleus:

* a) radius (mean of distances from center to points on the perimeter)
* b) texture (standard deviation of gray-scale values)
* c) perimeter
* d) area
* e) smoothness (local variation in radius lengths)
* f) compactness (perimeter^2 / area - 1.0)
* g) concavity (severity of concave portions of the contour)
* h) concave points (number of concave portions of the contour)
* i) symmetry
* j) fractal dimension ("coastline approximation" - 1)

Não há daldos faltantes, dado pelo concedente do dataset

In [None]:
df.isnull().sum()

Essa coluna unnamed não está declarada nas informações do Dataset, então vou dropá-la, junto com id

In [None]:
df = df.drop(['id','Unnamed: 32'],axis=1)

Por fim, vamos codificar a coluna de diagnóstico:

In [None]:
df['diagnosis'] = df.diagnosis.astype('category').cat.codes

Vamos analisar os tipos de cada coluna:

In [None]:
df.info()

Ok, tudo certo por aqui, podemos partir para as análises.

In [None]:
corrMatrix = df.corr()
fig = plt.figure(figsize=(20,20))
sns.heatmap(corrMatrix, annot=True)
plt.show()

Vamos listar a matriz, para vermos os valores:

In [None]:
df.corr()

Vamos analisar nossa coluna de diagnóstico, para saber se nosso dataset está equilibrado:

In [None]:
fig = plt.figure(figsize=(5,5))
sns.countplot(x=df.diagnosis,data=df)
plt.xlabel("Diagnóstico: 0 - Maligno, 1 - Benigno")
plt.ylabel("Contagem")

Temos cerca de 75% a mais de dados de malignos, isso pode ser um impecílio depois, nosso modelo pode ter um bias, mas vamos lá!

<a id="2"></a> 
# Aplicando XGBoost

Vamos começar determinando nossas features:

In [None]:
feature_cols = df.columns.values.tolist()
target_col = ['diagnosis']
feature_cols.remove('diagnosis')

In [None]:
# Carregando os dataframes

X = df.drop('diagnosis', axis=1)
y = df['diagnosis']

# Declarando XGBRegressor,
from xgboost import XGBClassifier

xgb = XGBClassifier(objective='binary:logistic')
xgb.fit(X, y)

<a id="3"></a> 
# Utilizando Optuna para Hyperparameter Tuning

In [None]:
def objective(trial,data=X,target=y):
    
    train_x, test_x, train_y, test_y = train_test_split(data, target, test_size=0.15,random_state=42)
    
    model = XGBClassifier(objective='binary:logistic')  
    
    model.fit(train_x,train_y,eval_set=[(test_x,test_y)],early_stopping_rounds=100,verbose=False)
    
    preds = model.predict(test_x)
    
    rmse = mean_squared_error(test_y, preds,squared=False)
    
    return rmse

In [None]:
study = optuna.create_study(direction='minimize', study_name = 'xgbclassifier') 
study.optimize(objective, n_trials=50)

In [None]:
study.best_params

In [None]:
#fmodel = XGBClassifier(**study.best_params,tree_method='gpu_hist', use_label_encoder=False)
fmodel = XGBClassifier(**study.best_params, use_label_encoder=False)
fmodel.fit(X, y)

Agora, vamos determinar o score do nosso modelo:

In [None]:
# Declarando nosso split de teste e validacao

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.3, random_state=2)

In [None]:
from sklearn.model_selection import cross_val_score

cross_val_score(fmodel, X, y, scoring='accuracy')

<a id="4"></a> 
# Conclusão
* Quaisquer dúvidas, responderei vide comentário ou email
* Recomendações e gratificações são sempre bem-vindas!
* Um abraço!

***Obrigado pelo seu tempo,***
***Lucas Silva***