**<center><font size=6>Breast Cancer Wisconsin (Diagnostic)</font>
</center>**

**Data Set origin**: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

**Date**: 08.01.2021

**Table of Contents**
- <a href='#read'>1. Reading the data</a> 
- <a href='#understand'>2. Understanding and preparing the data</a>
    - <a href='#describe'>2.1. Describing and planning the data</a>
    - <a href='#group'>2.2. Preprocessing</a>
- <a href='#split'>3. Splitting the data</a>
- <a href='#fit'>4. Fitting and validating the models</a>
    - <a href='#LgR'>4.1. Logistic Regression (LgR) and GridSearchCV</a>
    - <a href='#PCALgR'>4.2. Principal Component Analysis (PCA) and Logistic Regression (LgR) and GridSearchCV</a>
    - <a href='#LinSVC'>4.3. Linear SupportVectorClassifier (LinSVC) and GridSearchCV</a>
    - <a href='#RBFSVC'>4.4. RBF SupportVectorClassifier (RBF SVC) and GridSearchCV</a>
    - <a href='#DT'>4.5. DecisionTreeClassifier (DT) and GridSearchCV</a>
    - <a href='#RF'>4.6. RandomForestClassifier (RF) and GridSearchCV</a>
    - <a href='#KNN'>4.7. K-Nearest-Neighbours (KNN) and GridSearchCV</a>    
    - <a href='#OLS'>4.8. Ordinary Least Squares Linear Regression (OLS)</a>
- <a href='#summary'>5. Summary</a>

# <a id='read'>1. Reading the data</a>

In [None]:
# Matplotlib config
%matplotlib inline
%config InlineBackend.figure_formats = ['svg']
%config InlineBackend.rc = {'figure.figsize': (5.0, 4.0)}

import pandas as pd
import numpy as np
import csv
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score, precision_score, plot_confusion_matrix
from sklearn.model_selection import GridSearchCV, RepeatedKFold, train_test_split

from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
import statsmodels.api as sm

input_file = "../input/breast-cancer-wisconsin-data/data.csv"
df = pd.read_csv(input_file, header = 0, sep = ',', quotechar='"')
df.head()

# <a id='understand'>2. Understanding and preparing the data</a>

In [None]:
df.columns

In [None]:
%matplotlib
df.info()
df.describe()

## <a id='describe'>2.1. Describing and planning the data</a>

| Variable orig | not missing values | Type | Ranges/Values | Describing | Preparing/Transforming | New Variable | New Ranges/New Values |
| :- | --- | :- | :- | :- | :- | :- | :- |
| id | 569 | int64 | | | delete | | |
| diagnosis | 569 | object | B/M | B=benign, M=malignant, Target variable | categories as boolean or numbers | diagnosis_ord | False=0=B, True=1=M |
| optional |  |  |  |  | normalising | | 0-1 | 

## <a id='group'>2.2. Preprocessing</a>

In [None]:
#Delete the "id" column
df.drop('id', axis = 1, inplace = True)

In [None]:
df['diagnosis'].unique()

In [None]:
#"diagnosis" column in True/False
df['diagnosis'] = df['diagnosis'] == "M"
df['diagnosis_ord'] = pd.Categorical(df.diagnosis).codes

In [None]:
#Grouping
print('malignant: ' + str(np.mean(df['diagnosis'])))

# <a id='split'>3. Splitting the data</a>

Splitting the data in training and validation sets in new csv files.

| Kaggle set | | Splitted sets |need for |
| :- | --- | :- | :- |
| train set | 80% | train set | training and validating the models |
| test set | 20% | test set | testing the models|

In [None]:
#All columns correlated with the "diagnosis"
df.corr()["diagnosis"].abs().sort_values(ascending = False)

In [None]:
#Correlation Matrix Graphic
sns.heatmap(df.corr());

In [None]:
#Correlation Matrix Table
df.corr()

In [None]:
#Attribute choice
features = ['concave points_worst', 'perimeter_worst', 'radius_worst', \
            'concave points_worst', 'area_worst', 'concavity_mean', \
            'compactness_mean', 'texture_worst', 'smoothness_worst', \
            'symmetry_worst', 'fractal_dimension_worst']

X = df[['concave points_worst', 'perimeter_worst', 'radius_worst', \
        'concave points_worst', 'area_worst', 'concavity_mean', \
        'compactness_mean', 'texture_worst', 'smoothness_worst', \
        'symmetry_worst', 'fractal_dimension_worst']]

y = df['diagnosis']

#y = df['diagnosis_ord']

#train set
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.80, random_state = 42)

# <a id='fit'>4. Building the models</a>

## <a id='LgR'>4.1. Logistic Regression (LgR)</a>

In [None]:
#Logistic Regression (LgR)
#Parameter tuning with GridSearchCV and RepeatedKFold
model_lgrgscv = GridSearchCV(LogisticRegression(), param_grid = { 
    "max_iter": [1000000],
    "class_weight": ["balancend"]
}, cv = RepeatedKFold())

model_lgrgscv.fit(X_train, y_train)

print('LgR GridSearchCV best model parameter: ' + str(model_lgrgscv.best_params_))
print('LgR GridSearchCV best model score: ' + str(model_lgrgscv.best_score_))

In [None]:
#Train a Logistic Regression
model_lgr = LogisticRegression(class_weight = "balancend", max_iter = 1000000)
model_lgr.fit(X_train, y_train)

In [None]:
#Predict
y_lgr_pred_test = model_lgr.predict(X_test)

In [None]:
#Confusion matrix
plot_confusion_matrix(model_lgr, X_test, y_test, normalize = "all")

In [None]:
#Scores
print('LgR train score: ' + str(model_lgr.score(X_train, y_train)))
print('LgR test score: ' + str(model_lgr.score(X_test, y_test)))

## <a id='PCALgR'>4.2. Principal Component Analysis and Logistic Regression (PCA LgR)</a></a>

In [None]:
#Dimension reduction with Principal Component Analysis (PCA) and than Logistic Regression (LgR)
#Parameter tuning with GridSearchCV and RepeatedKFold
pca = PCA(n_components = 4)

# pca.fit(X_train)
# X_train_transformed = pca.transform(X_train)

X_train_transformed = pca.fit_transform(X_train)

model_pcalgrgscv = GridSearchCV(LogisticRegression(), param_grid = { 
    "max_iter": [1000000]
}, cv = RepeatedKFold())

model_pcalgrgscv.fit(X_train_transformed, y_train)

print('PCA LgR GridSearchCV best model parameter: ' + str(model_pcalgrgscv.best_params_))
print('PCA LgR GridSearchCV best model score: ' + str(model_pcalgrgscv.best_score_))

In [None]:
#Train
model_pcalgr= LogisticRegression(class_weight = "balancend")
model_pcalgr.fit(X_train_transformed, y_train)

In [None]:
sns.scatterplot(X_train_transformed[:, 0], X_train_transformed[:, 1], hue = y_train);

In [None]:
#Predict
X_test_transformed = pca.transform(X_test)

y_pcalgr_test_pred = model_pcalgr.predict(pca.transform(X_test))

In [None]:
sns.scatterplot(X_test_transformed[:, 0], X_test_transformed[:, 1], hue = y_test);

In [None]:
#Confusion matrix
plot_confusion_matrix(model_pcalgr, X_test_transformed, y_test, normalize = "all")

In [None]:
#Scores
print('PCA LgR train score: ' + str(model_pcalgr.score(X_train_transformed, y_train)))
print('PCA LgR test score: ' + str(model_pcalgr.score(X_test_transformed, y_test)))

## <a id='LinSVC'>4.3. Linear SupportVectorClassifier (LinSVC) and GridSearchCV</a>

In [None]:
#Linear SupportVectorClassifier (LinSVC)
#Parameter tuning with GridSearchCV and RepeatedKFold
sc = StandardScaler()
sc_train = sc.fit(X_train)

X_train_scalar = sc.transform(X_train)
X_test_scalar = sc.transform(X_test)

model_linsvcgscv = GridSearchCV(LinearSVC(), param_grid = {
    "C": [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1], 
    "max_iter": [1000000]
}, cv = RepeatedKFold())

model_linsvcgscv.fit(X_train_scalar, y_train)

print('LinSVC GridSearchCV best model parameter: ' + str(model_linsvcgscv.best_params_))
print('LinSVC GridSearchCV best model score: ' + str(model_linsvcgscv.best_score_))

In [None]:
#Train
model_linsvc = LinearSVC(max_iter = 1000000, C = 0.05)
model_linsvc.fit(X_train_scalar, y_train)

In [None]:
#Predict
y_linsvc_test_pred = model_linsvc.predict(X_test_scalar)

In [None]:
#Confusion matrix 
plot_confusion_matrix(model_linsvc, X_test_scalar, y_test, normalize = "all")

In [None]:
#Scores
print('LinSVC train score: ' + str(model_linsvc.score(X_train_scalar, y_train)))
print('LinSVC test score: ' + str(model_linsvc.score(X_test_scalar, y_test)))

## <a id='RBFSVC'>4.4. RBF SupportVectorClassifier (RBF SVM) and GridSearchCV</a>

In [None]:
#RBF SupportVectorClassifier (RBF SVM)
#Parameter tuning with GridSearchCV and RepeatedKFold
model_rbfsvcgscv = GridSearchCV(SVC(), param_grid = {
    "kernel": ["rbf"], 
    "C": [15, 20, 25, 30, 35, 40, 45], 
    "gamma": [0.0005, 0.001, 0.005, 0.01, 0.05]
}, cv = RepeatedKFold())

model_rbfsvcgscv.fit(X_train_scalar, y_train)

print('RBF SVC GridSearchCV best model parameter: ' + str(model_rbfsvcgscv.best_params_))
print('RBF SVC GridSearchCV best model score: ' + str(model_rbfsvcgscv.best_score_))

In [None]:
#Train
model_rbfsvc = SVC(kernel = "rbf", C = 25, gamma = 0.005)
model_rbfsvc.fit(X_train_scalar, y_train)

In [None]:
#Predict
y_rbfsvc_test_pred = model_rbfsvc.predict(X_test)

In [None]:
#Confussion matrix
plot_confusion_matrix(model_linsvc, X_test_scalar, y_test, normalize = "all")

In [None]:
#Scores
print('RBF SVC train score: ' + str(model_rbfsvc.score(X_train_scalar, y_train)))
print('RBF SVC test score: ' + str(model_rbfsvc.score(X_test_scalar, y_test)))

## <a id='DT'>4.5. DecisionTreeClassifier (DT) and GridSearchCV</a>

In [None]:
#DecisionTreeClassifier (DT)
#Parameter tuning with GridSearchCV and RepeatedKFold
X_train_dumm = pd.get_dummies(X_train[features])
X_test_dumm = pd.get_dummies(X_test[features])

model_dtgscv = GridSearchCV(DecisionTreeClassifier(), param_grid = {
    'max_depth': [23, 24, 25, 26, 27, 28, 29, 30, 31, 32],
    'min_samples_leaf': [2, 3, 4, 5, 6, 7]
}, cv = RepeatedKFold())

model_dtgscv.fit(X_train_dumm, y_train)

print('DT GridSearchCV best model parameters: ' + str(model_dtgscv.best_params_))
print('DT GridSearchCV best model score: ' + str(model_dtgscv.best_score_))

In [None]:
#Train
model_dt = DecisionTreeClassifier(max_depth=28, min_samples_leaf = 5)
model_dt.fit(X_train_dumm, y_train)

In [None]:
#Predict
y_dt_test_pred = model_dt.predict(X_train_dumm)

In [None]:
#Confusion matrix
plot_confusion_matrix(model_dt, X_test_dumm, y_test, normalize = "all")

In [None]:
#Scores
print('DT train score: ' + str(model_dt.score(X_train_dumm, y_train)))
print('DT test score: ' + str(model_dt.score(X_test_dumm, y_test)))

## <a id='RF'>4.6. RandomForestClassifier (RF) and GridSearchCV</a>

In [None]:
#RandomForestClassifier (RF)
#Parameter tuning with GridSearchCV and RepeatedKFold
model_rfgscv = GridSearchCV(RandomForestClassifier(), param_grid = {
    'max_depth': [11, 12, 13, 14, 15],
    'min_samples_leaf': [1, 5, 10]
}, cv = RepeatedKFold())

model_rfgscv.fit(X_train_dumm, y_train)

print('RF GridSearchCV best model parameter: ' + str(model_rfgscv.best_params_))
print('RF GridSearchCV best model score: ' + str(model_rfgscv.best_score_))

In [None]:
#Train
model_rf = RandomForestClassifier(n_estimators=100, max_depth=15, min_samples_leaf = 1)
model_rf.fit(X_train_dumm, y_train)

In [None]:
#Predict
y_rf_test_pred = model_rf.predict(X_test_dumm)

In [None]:
#Confusion matrix
plot_confusion_matrix(model_rf, X_test_dumm, y_test, normalize = "all")

In [None]:
#Scores
print('RF train score: ' + str(model_rf.score(X_train_dumm, y_train)))
print('RF test score: ' + str(model_rf.score(X_test_dumm, y_test)))

## <a id='KNN'>4.7. K-Nearest-Neighbours (KNN) and GridSearchCV</a>

In [None]:
#K-Nearest-Neighbours (KNN)
#Parameter tuning with GridSearchCV and RepeatedKFold
model_knngscv = GridSearchCV(KNeighborsClassifier(), param_grid = {
    'n_neighbors': [5, 6, 7, 8, 9, 10, 15, 20, 25, 35, 50, 75],
    'p': [1, 2], 
    'weights': ['uniform', 'distance']
}, cv = RepeatedKFold())

model_knngscv.fit(X_train_scalar, y_train)

print('KNN GridSearchCV best model parameter: ' + str(model_knngscv.best_params_))
print('KNN GridSearchCV best model score: ' + str(model_knngscv.best_score_))

In [None]:
#Train
model_knn = KNeighborsClassifier(n_neighbors = 9, p = 1, weights = 'uniform')
model_knn.fit(X_train_scalar, y_train)

In [None]:
#Predict
y_knn_test_pred = model_knn.predict(X_test_scalar)

In [None]:
#Confusion Matrix
plot_confusion_matrix(model_knn, X_test_scalar, y_test, normalize = "all")

In [None]:
#Scores
print('KNN train score: ' + str(model_knn.score(X_train_scalar, y_train)))
print('KNN test score: ' + str(model_knn.score(X_test_scalar, y_test)))

## <a id='OLS'>4.8. Ordinary Least Squares (OLS)</a>

In [None]:
#Ordinary Least Squares (OLS)
#Train
X_train_ols = sm.add_constant(X_train)
model_ols = sm.OLS(y_train, X_train_ols).fit()

#summary
model_ols.summary()

In [None]:
#Predict
X_test_ols = sm.add_constant(X_test)

y_ols_train_pred = model_ols.predict(X_train_ols)
y_ols_test_pred = model_ols.predict(X_test_ols)

y_ols_train_pred = round(y_ols_train_pred)
y_ols_test_pred = round(y_ols_test_pred)

In [None]:
#Confusion matrix

#Train set
ols_train_cm = confusion_matrix(y_train, y_ols_train_pred, normalize = "all")
ols_train_score = accuracy_score(y_train, y_ols_train_pred, normalize = "all")

print('Train OLS Confusion Matrix: \n' + str(ols_train_cm) + '\n')

#Test set
ols_test_cm = confusion_matrix(y_test, y_ols_test_pred, normalize = "all")
ols_test_score = accuracy_score(y_test, y_ols_test_pred, normalize = "all")

print('Test OLS Confusion Matrix: \n' + str(ols_test_cm))

In [None]:
#Scores
print('OLS train score: ' + str(ols_train_score))
print('OLS test score: ' + str(ols_test_score))

# <a id='summary'>5. Summary</a>

In [None]:
#Summary, Scores
relevant_metrics_pred = pd.DataFrame({
    'Model': [ 'Logistic Regression', 'PCA Logistic Regression', 'LinSVC', 'RBFSVC', 'Decision Tree', 'Random Forest', 'K-Nearest-Neighbours', 'OLS'],
    'Accuracy, A': [model_lgr.score(X_test, y_test), model_pcalgr.score(X_test_transformed, y_test), model_linsvc.score(X_test_scalar, y_test), model_rbfsvc.score(X_test_scalar, y_test), model_dt.score(X_test_dumm, y_test), model_rf.score(X_test_dumm, y_test), model_knn.score(X_test_scalar, y_test), ols_test_score]})
best_model_pred =relevant_metrics_pred.sort_values(by='Accuracy, A', ascending=False)
best_model_pred