# Breast cancer Wisconsin

## Librairies utiles

In [None]:
# Directive pour afficher les graphiques dans Jupyter
%matplotlib inline

In [None]:
# Pandas : librairie de manipulation de données
# NumPy : librairie de calcul scientifique
# MatPlotLib : librairie de visualisation et graphiques
# SeaBorn : librairie de graphiques avancés
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, roc_auc_score,auc, accuracy_score

## Le dataset Breast Cancer Wisconsin

Le dataset est accessible sur :  
https://www.kaggle.com/uciml/breast-cancer-wisconsin-data  
http://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28diagnostic%29  
(on peut utiliser pd.read_table pour lire un fichier .dat)

In [None]:
df = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')

On peut afficher les 10 premières lignes du dataset :

In [None]:
df.head(10)

On a les informations suivantes :
1) ID number 2) Diagnosis (M = malignant, B = benign) 3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)  
b) texture (standard deviation of gray-scale values)  
c) perimeter  
d) area  
e) smoothness (local variation in radius lengths)  
f) compactness (perimeter^2 / area - 1.0)  
g) concavity (severity of concave portions of the contour)  
h) concave points (number of concave portions of the contour)  
i) symmetry  
j) fractal dimension ("coastline approximation" - 1)  
  
The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.  
  
All feature values are recoded with four significant digits.

In [None]:
df.columns

Pour avoir l'ensemble du tableau, on peut utiliser un affichage au format HTML :

In [None]:
from IPython.core.display import HTML # permet d'afficher du code html dans jupyter
display(HTML(df.head(10).to_html()))

In [None]:
df = df.drop(['id', 'Unnamed: 32'], axis=1)

In [None]:
df['diagnosis'] = df['diagnosis'].map({"B":0, "M":1})

Note : Keras n'accepte pas les dataframes, on utilise "values" pour avoir des valeurs (tableaux)

In [None]:
y = df['diagnosis'].values
X = df.drop(['diagnosis'], axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [None]:
model = Sequential()
model.add(Dense(1, activation="sigmoid"))

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
train = model.fit(X_train , y_train , validation_data=(X_test,y_test), epochs=100, verbose=1)

In [None]:
y_ann = model.predict_classes(X_test).flatten()

In [None]:
y_ann

In [None]:
accuracy_score(y_test, y_ann)

In [None]:
confusion_matrix(y_test, y_ann)

In [None]:
def plot_scores(train) :
    accuracy = train.history['accuracy']
    val_accuracy = train.history['val_accuracy']
    epochs = range(len(accuracy))
    plt.plot(epochs, accuracy, 'b', label='Score apprentissage')
    plt.plot(epochs, val_accuracy, 'r', label='Score validation')
    plt.title('Scores')
    plt.legend()
    plt.show()

In [None]:
plot_scores(train)