# Breast Cancer Wisconsin (Diagnostic) DataSet

------> Attribute Information:

1) ID number

2) Diagnosis (M = malignant, B = benign)

3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)

b) texture (standard deviation of gray-scale values)

c) perimeter

d) area

e) smoothness (local variation in radius lengths)

f) compactness (perimeter^2 / area - 1.0)

g) concavity (severity of concave portions of the contour)

h) concave points (number of concave portions of the contour)

i) symmetry

j) fractal dimension ("coastline approximation" - 1)

------>The mean, standard error and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features. For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.

------>All feature values are recoded with four significant digits.

------>Missing attribute values: none

------>Class distribution: 357 benign, 212 malignant

### Import the DataSet and Vizualize the Information

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
df = pd.read_csv('/kaggle/input/breast-cancer-wisconsin-data/data.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.isnull().sum()

As we can see only the 'Unnamed: 32' feature has all values filled with null. So this columns can be droped!

### Engineering Data Analysis

##### 1) 'Unnamed: 32' feature.

In [None]:
df = df.drop('Unnamed: 32', axis=1)

##### 2) 'id' feature.

Knowing the id of the patient will be irrelevant for our model and may cause confusion, so we drop it.

In [None]:
df = df.drop('id', axis = 1)

##### 3) 'diagnosis' feature.

As we can see this feature is a String Object so we need to convert it to a binary classification. By default M will be assign to 1 and B will be assign to 0.

In [None]:
df['diagnosis'] = df['diagnosis'].replace(['M', 'B'], [1,0])

In [None]:
sns.countplot('diagnosis', data = df)

##### Correlation Between The Features

In [None]:
df.corr()['diagnosis'].sort_values()

In [None]:
df.corr()['diagnosis'][:-1].sort_values().plot(kind='bar')

In [None]:
plt.figure(figsize=(30,30))
sns.heatmap(df.corr(), annot = True, cmap= "coolwarm")

# <font color='green'><b> NEURAL NETWORKS </b> (Multi-Layer Perceptron)</font>

### Train Test Split

Let's now divide our dataset in two parts. The X with the features and the Y with the lable 'diagnosis'.

In [None]:
X = df.drop('diagnosis', axis = 1).values
Y = df['diagnosis'].values

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.25,random_state=101)

### Scaling Data

Lets scaling the data to avoid problem in training our Neural Network. Reminder: fit_transform -> X_train ; tranform -> X_test

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [None]:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Creating the Model

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation,Dropout,Flatten, Conv2D, MaxPooling2D

In [None]:
X_train.shape

In [None]:
model = Sequential()

model.add(Dense(units=30,activation='relu'))
model.add(Dense(units=20,activation='relu'))
model.add(Dense(units=10,activation='relu'))

model.add(Dense(units=1,activation='sigmoid'))

# For a binary classification problem
model.compile(loss='binary_crossentropy', optimizer='adam')

To avoid overfiting create a earlystop criteria!

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
cb = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=25)

### Training The Model

In [None]:
model.fit(x=X_train_scaled,y=Y_train, validation_data=(X_test_scaled, Y_test), batch_size=450, epochs=600, callbacks=[cb])

After we train our model analyze the loss and validation loss in order to find out if any change need to be done in the call back to make the model better.

In [None]:
losses = pd.DataFrame(model.history.history)

In [None]:
losses.plot()

### Model Evaluation

In [None]:
predictions = (model.predict(X_test_scaled) > 0.5).astype("int32")

In [None]:
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score

In [None]:
print(confusion_matrix(Y_test,predictions))

In [None]:
print(classification_report(Y_test,predictions))

In [None]:
print(accuracy_score(Y_test,predictions))

# <font color='green'><b> KNeighbors Classifier </b></font>

Any variables that are on a large scale will have a much larger effect on the distance between the observations than variables that are on a small scale. So we have to standarize the variables!

### Standardize the Features

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler_kkn = StandardScaler()

In [None]:
scaled_features = scaler_kkn.fit_transform(X)

### Train Test Split

In oposition to the Logistic Regression, we need to split the data into training data and test data, using the standadize data.

Note: To take a better conclusion from the analyse of the models we will keep the test_size constant

In [None]:
X_train_kn, X_test_kn, Y_train_kn, Y_test_kn = train_test_split(scaled_features,Y, test_size=0.25)

### Choosing a K Value

Note: this code is called the "Elbow Method" used to choose the best K value.

Taken from: https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/

In [None]:
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist

distortions = []
inertias = []
mapping1 = {}
mapping2 = {}

for k in range(1,30):
    # Building and fitting the model
    kmeanModel = KMeans(n_clusters=k).fit(X)
    kmeanModel.fit(X)
 
    distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_,
                                        'euclidean'), axis=1)) / X.shape[0])
    inertias.append(kmeanModel.inertia_)
 
    mapping1[k] = sum(np.min(cdist(X, kmeanModel.cluster_centers_,
                                   'euclidean'), axis=1)) / X.shape[0]
    mapping2[k] = kmeanModel.inertia_


In [None]:
plt.plot(range(1,30), distortions, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Distortion')
plt.title('The Elbow Method using Distortion')
plt.show()

From this model we can conclude that the best K value is around 8, where we see the elbow. So lets train our model with that number of clusters.

### Creating The Model

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors=8)

### Train and Predicting

In [None]:
knn.fit(X_train_kn,Y_train_kn)

In [None]:
predictions = knn.predict(X_test_kn)

### Evaluating The Model

In [None]:
print(confusion_matrix(Y_test_kn,predictions))

In [None]:
print(classification_report(Y_test_kn,predictions))

In [None]:
print(accuracy_score(Y_test_kn,predictions))

# <font color='green'><b> Decision Trees </b></font>

### Train Test Split

In [None]:
X_train_dt, X_test_dt, Y_train_dt, Y_test_dt = train_test_split(X,Y, test_size=0.25)

### Creating The Model

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
decision_tree_model = DecisionTreeClassifier()

### Train and Predicting

In [None]:
decision_tree_model.fit(X_train_dt,Y_train_dt)

In [None]:
predictions = decision_tree_model.predict(X_test_dt)

### Evaluating The Model

In [None]:
print(confusion_matrix(Y_test_dt,predictions))

In [None]:
print(classification_report(Y_test_dt,predictions))

In [None]:
print(accuracy_score(Y_test_dt,predictions))

# <font color='green'><b> Random Forest </b></font>

In [None]:
from sklearn.ensemble import RandomForestClassifier

### Creating The Model

In [None]:
rf = RandomForestClassifier(n_estimators=100)

### Train and Predicting

In [None]:
rf.fit(X_train_dt, Y_train_dt)

In [None]:
predictions = rf.predict(X_test_dt)

### Evaluating The Model

In [None]:
print(confusion_matrix(Y_test_dt,predictions))

In [None]:
print(classification_report(Y_test_dt,predictions))

In [None]:
print(accuracy_score(Y_test_dt,predictions))

# <font color='green'><b> Support Vector Machine </b></font>

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

### Tunning the best parameters for C and Gamma

We start to create a dictionary with the most commun values:

In [None]:
parameters = [{'kernel': ['rbf'], 'gamma': [ 1e-3, 1e-2, 1e-2, 1e0, 1e1, 1e2],
                     'C': [0.001,0.01, 0.1, 1, 10, 100, 1000, 10000,100000]}]

Now we preform a cicle to train and evaluate our model evaluating the precision and recall with each combination of C and Gamma.

In [None]:
scores = ['precision', 'recall']
for score in scores:
 
    model_svm = GridSearchCV(SVC(),parameters, cv= 3, scoring='%s_macro' % score)
    model_svm.fit(X_train, Y_train)

    print("Best parameters set found on development set:")
    print('Gamma:',model_svm.best_estimator_.gamma)
    print('C:',model_svm.best_estimator_.C)

### Creating The Model

In [None]:
model_svm_best = SVC(max_iter = 1000000, kernel = 'rbf', gamma =model_svm.best_estimator_.gamma, C =model_svm.best_estimator_.C)


In [None]:
model_svm_best.fit(X_train, Y_train)

### Train and Predict

In [None]:
predictions = model_svm_best.predict(X_test)

In [None]:
print(confusion_matrix(Y_test,predictions))

In [None]:
print(classification_report(Y_test,predictions))

In [None]:
print(accuracy_score(Y_test,predictions))