## Breast Cancer Wisconsin (Diagnostic) Data Set
---
### Target:

Predict breast cancer diagnosis with data from the state of Wisconsin - USA.

### Predict whether the cancer is benign or malignant:

Resources are calculated from a scanned image of a fine needle aspirate (PAAF) of a breast mass. They describe characteristics of the cell nuclei present in the image.

### Attribute information:

1) ID number

2) Diagnosis (M = malignant, B = benign)

### Ten resources with real value are calculated for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)

b) texture (standard deviation of gray-scale values)

c) perimeter

d) area

e) smoothness (local variation in radius lengths)

f) compactness (perimeter^2 / area - 1.0)

g) concavity (severity of concave portions of the contour)

h) concave points (number of concave portions of the contour)

i) symmetry

j) fractal dimension ("coastline approximation" - 1)

### Preliminary information:

The mean, standard error and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant

---

### Package data analysis:
---

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pylab import rcParams
import matplotlib.gridspec as gridspec

### Inputs:
___

In [None]:
df = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv', header = [0])
feature = [feat for feat in list(df) if feat not in ['id','Unnamed: 32']]
df1 = df.filter(feature)

### Exploratory data analysis:
---

#### Variable target:

In [None]:
print("Variable target - Diagnosis")
print(" ")
print(df1.diagnosis.value_counts())
print("\nBenign cases represent {:.4f}% in dataset.\n".format((df1[df1.diagnosis == 'B'].shape[0] / df1.shape[0]) * 100))
plt.figure(figsize=(10,8))
sns.countplot('diagnosis',data=df1)
plt.title("Variable target - Diagnosis")
plt.show()

#### Descriptive statistics:

* Feature average statistics

In [None]:
df1.filter(['radius_mean','texture_mean','perimeter_mean','area_mean','smoothness_mean',
            'compactness_mean','concavity_mean','concave points_mean','symmetry_mean','fractal_dimension_mean']).describe()

* Feature Standard deviation statistics

In [None]:
df1.filter(['radius_se','texture_se','perimeter_se','area_se','smoothness_se',
            'compactness_se','concavity_se','concave points_se','symmetry_se','fractal_dimension_se']).describe()

* Statistics of the worst measures of the characteristics

In [None]:
df1.filter(['radius_worst','texture_worst','perimeter_worst','area_worst','smoothness_worst',
            'compactness_worst','concavity_worst','concave points_worst','symmetry_worst','fractal_dimension_worst']). describe()

#### Histogram relation of variables with Target:

In [None]:
v_features = df1.iloc[:,1:31].columns
plt.figure(figsize=(12,31*8))
gs = gridspec.GridSpec(31, 1)
for i, cn in enumerate(df1[v_features]):
    ax = plt.subplot(gs[i])
    sns.distplot(df1[cn][df1.diagnosis == 'B'], bins=50)
    sns.distplot(df1[cn][df1.diagnosis == 'M'], bins=50)
    ax.set_xlabel('')
    ax.set_title('Histogram relation of variables with Target: ' + str(cn))
plt.show()

#### Correlation map:

In [None]:
sns.set(rc={'figure.figsize':(10,8)})
sns.heatmap(df1.corr(method='spearman'),fmt = '.2f',cmap='Greens')
plt.title('Correlação entre variáveis')
plt.show()

### PCA modeling to reduce dimensionality:
---

In [None]:
# Package decomposition PCA:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Feature select:
feature = ['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean',
 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se',
 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst',
 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']
x = df1.filter(feature)

# Pre-processing:
x = StandardScaler().fit_transform(x)

# Decomposition PCA:
pca = PCA(n_components=4)
principalComponents = pca.fit_transform(x)
var_explicada = pca.explained_variance_ratio_
var_exp_df = pd.DataFrame({"var_exp":var_explicada})
print("The explained variance of the four components: ",(var_exp_df['var_exp'].sum().round(2))*100,"%")

# Dataset PCA:
principalDf = pd.DataFrame(data = principalComponents,columns = ['pc1', 'pc2', 'pc3', 'pc4'])
df_pca = pd.concat([principalDf, df1['diagnosis']], axis = 1)
print(" ")
print("Dataset with the main components: ")
print(" ")
print(df_pca.head(3))
print(" ")

# Graph PCA - PC1 e PC2:
plt.figure(figsize=(10,8))
sns.scatterplot(x="pc1", y="pc2", hue="diagnosis", data=df_pca)
plt.title("Principal Components PC1 and PC2")
plt.show()

# Graph PCA - PC1 e PC3:
plt.figure(figsize=(10,8))
sns.scatterplot(x="pc1", y="pc3", hue="diagnosis", data=df_pca)
plt.title("Principal Components PC1 and PC3")
plt.show()

# Graph PCA - PC1 e PC4:
plt.figure(figsize=(10,8))
sns.scatterplot(x="pc1", y="pc4", hue="diagnosis", data=df_pca)
plt.title("Principal Components PC1 and PC4")
plt.show()

### Baseline - Logistic regression:
---

In [None]:
# Packages:
import random
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (confusion_matrix,auc,roc_curve,classification_report)

# Data splitting:
xtr, xval, ytr, yval = train_test_split(df_pca.drop('diagnosis',axis=1),df_pca['diagnosis'],test_size=0.2,random_state=1025)

# Training model:
baseline = LogisticRegression()
baseline.fit(xtr,ytr)

# Predict:
p = baseline.predict(xval)

# Confusion matrix:
cmx = confusion_matrix(yval, p)
sns.set(rc={'figure.figsize':(10,8)})
sns.set(font_scale=1.4)
sns.heatmap(cmx,annot=True,annot_kws={"size": 14},cmap='Greens')
plt.title("Confusion matrix")
plt.show()

# Metrics:
print("Metrics: ")
print(classification_report(yval, p))

### AutoML - H2O:
---

#### Start cluster h2o:

In [None]:
import h2o
from h2o.automl import H2OAutoML
h2o.init()

#### Data splitting:

In [None]:
train, test = train_test_split(df1, test_size=0.2)
traindf = h2o.H2OFrame(train)
testdf = h2o.H2OFrame(test)
y = "diagnosis"
x = list(traindf.columns)
x.remove(y)
traindf[y] = traindf[y].asfactor()
testdf[y] = testdf[y].asfactor()

#### AutoML H2O training:

In [None]:
aml = H2OAutoML(max_models = 80, max_runtime_secs = 300, seed = 247)
aml.train(x = x, y = y, training_frame = traindf)
print(aml.leaderboard)

#### Test model:

In [None]:
predict = aml.predict(testdf)
p = predict.as_data_frame()
print(" ")
data = {'actual': test.diagnosis,'predict': p['predict'].tolist()}
df = pd.DataFrame(data, columns = ['actual','predict'])
df.head(5)

#### Confusion matrix:

In [None]:
confusion_matrix = pd.crosstab(df['actual'], df['predict'], rownames=['Actual'], colnames=['Predicted'])
sns.set(rc={'figure.figsize':(10,8)})
sns.set(font_scale=1.4)
sns.heatmap(confusion_matrix,annot=True,annot_kws={"size": 16},cmap='Greens')
plt.title("Confusion matrix")
plt.show()
print("Metrics:")
print(classification_report(df['actual'], df['predict']))

#### Shutdown h2o cluster:

In [None]:
h2o.cluster().shutdown(prompt = False)