# Introduction
In this notebook, I tried to focus on finding the Best Machine Learning (ML) model for Breast Cancer Dataset.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import VotingClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score,confusion_matrix
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from pandas import DataFrame
from sklearn.svm import SVC
# Set seed for reproducibility
SEED = 123

In [None]:
dataset = pd.read_csv('/kaggle/input/breast-cancer-wisconsin-data/data.csv')

# Data Analysis

First start by analyzing the breast cancer data. Checking how many features and rows of data we have.

In [None]:
dataset.info()

In [None]:
dataset.head()

From above, we can observe that
1. There are 33 cols and 569 data rows.
2. A feature/col name **Unnamed: 32**, which contain all null values. So we can drop it. 
3. The **id** column are all uniques values, which won't be of any use.
4. **diagnosis** col is our class label.

In [None]:
dataset.drop(['Unnamed: 32', 'id'], axis=1, inplace=True)

From the class label values M and B, we know (M) Malignant means a person being diagnoed with Cancer and (B) Benign means a person not being diagnosed with Cancer. We will map these values to 1 and 0 respectively.

In [None]:
dataset['diagnosis'] = dataset['diagnosis'].map({'M':1, 'B':0})

In [None]:
dataset.describe()

From above info., we can say that
1. Different features have different mean values, standard deviation, min and max values. Some machine learning models don't perform well if the values are not standardized or normalized. So we will see later how this impacts.
2. We really don't understand these features and their meaning properly. We might not need to.
3. We don't know if all the features will be useful in determing the class label. There might be which features which are highly correlated and we don't need them. We will find that out.

# Data Visualization

Lets find out the distribution of Malignant and Benign data.

In [None]:
dataset['diagnosis'].value_counts()

In [None]:
sns.countplot(x='diagnosis', data=dataset)
plt.title('Breast Cancer Diagnosis')
plt.show()

From below plot we can say that as the **area_mean** and **radius_mean** values increase their is a higher chance a female being diagnosed with Cancer.

In [None]:
sns.scatterplot(x = 'area_mean', y = 'radius_mean', hue = 'diagnosis', data = dataset)
plt.show()

Similarly, from below plot we can say that as the **radius_worst** and **radius_mean** values increase their is a higher chance a female being diagnosed with **Cancer**.

In [None]:
sns.scatterplot(x = 'radius_worst', y = 'radius_mean', hue = 'diagnosis', data = dataset)
plt.show()

In [None]:
plt.figure(figsize=(20,10)) 
sns.heatmap(dataset.corr(), annot=True)
plt.show()

From the above correlation plot, we can see there are many features which are highly correlated and might not be useful in our model. 
For example, radius_mean is highly correlated with perimeter_mean, area_mean, radius_worst, perimeter_worst and area_worst. So we can drop them and use only radius_mean.

We create a new_dataset dataframe, so that we can compare it later with the original feature dataset if there is a difference in performance between the model predictions using these datasets. 

In [None]:
new_dataset = dataset.drop(['perimeter_mean', 'area_mean', 
                            'radius_worst', 'perimeter_worst', 'area_worst',
                           'perimeter_se', 'area_se', 'texture_worst',
                           'concave points_worst', 'concavity_mean', 'compactness_worst'], axis=1)

In [None]:
plt.figure(figsize=(20,10)) 
sns.heatmap(new_dataset.corr(), annot=True)
plt.show()

From the new_dataset heatmap, we can observe that we have removed most of the highly correlated features. 

In [None]:
X = new_dataset.drop(['diagnosis'], axis=1)
y = new_dataset['diagnosis']

In [None]:
X.head()

### Creating a test set and a training set

Since this data set is not ordered, we will to do a simple 70:30 split to create a training data set and a test data set.

In [None]:
# Split dataset into 70% train, 30% test
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.3, random_state=SEED)

# Feature Scaling

Most of the times, our dataset will contain features highly varying in magnitudes, units and range. 
But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations. 
We need to bring all features to the same level of magnitudes. This can be achieved by scaling. 
This means that you’re transforming your data so that it fits within a specific scale, like 0–100 or 0–1.

### Normalize the data

In [None]:
# fit scaler on training data
norm = MinMaxScaler().fit(X_train)

# transform training data
X_train_norm = norm.transform(X_train)

# transform testing dataabs
X_test_norm = norm.transform(X_test)

In [None]:
DataFrame(X_train_norm).describe()

From above, we can see that after normalizing the data all the columns have min and max values between 0 and 1 respectively.

### Standardize the data

In [None]:
# fit scaler on training data
stdscale = StandardScaler().fit(X_train)

# transform training data
X_train_std = stdscale.transform(X_train)

# transform testing dataabs
X_test_std = stdscale.transform(X_test)

In [None]:
DataFrame(X_train_std).describe()

From above, we can see that after standardizing the data all the columns have standard deviation of 1.

# Model Selection

In [None]:
# Instantiate individual classifiers
lr = LogisticRegression(max_iter = 500, n_jobs=-1, random_state=SEED)
knn = KNN()
dt = DecisionTreeClassifier(random_state=SEED)
svc = SVC(kernel='rbf', probability = True, random_state=SEED)
rf = RandomForestClassifier(random_state=SEED)

# Define a list called classifier that contains the tuples (classifier_name, classifier)
classifiers = [('Logistic Regression', lr),
('K Nearest Neighbours', knn),
('SVM', svc),
('Random Forest Classifier', rf),
('Decision Tree', dt)]              

### Models prediction without any normalization or standardization

In [None]:
# Iterate over the defined list of tuples containing the classifiers
for clf_name, clf in classifiers:
    #fit clf to the training set
    clf.fit(X_train, y_train)
    # Predict the labels of the test set
    y_pred = clf.predict(X_test)
    # Evaluate the accuracy of clf on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy_score(y_test, y_pred)))

### Models prediction with Normalized data

In [None]:
# Iterate over the defined list of tuples containing the classifiers
for clf_name, clf in classifiers:
    #fit clf to the training set
    clf.fit(X_train_norm, y_train)
    # Predict the labels of the test set
    y_pred = clf.predict(X_test_norm)
    # Evaluate the accuracy of clf on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy_score(y_test, y_pred)))

### Models prediction with Standardized data

In [None]:
# Iterate over the defined list of tuples containing the classifiers
for clf_name, clf in classifiers:
    #fit clf to the training set
    clf.fit(X_train_std, y_train)
    # Predict the labels of the test set
    y_pred = clf.predict(X_test_std)
    # Evaluate the accuracy of clf on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy_score(y_test, y_pred)))

From the above accuracy scores, we can observe the following:
1. DecisionTree and RandomForestClassifier are insensitive to feature scaling.
2. LinearRegression, KNN and SVM are sensitive to feature scaling.
3. SVM and LogisticRegression models gives us the highest accuracy.

Please find the reason behind this from the this article, which explains it very nicely

[feature-scaling-machine-learning-normalization-standardization](https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/)

In [None]:
cm = confusion_matrix(y_test, lr.predict(X_test_std))
sns.heatmap(cm, annot=True, fmt="d")
plt.show()

In [None]:
cm = confusion_matrix(y_test, svc.predict(X_test_std))
sns.heatmap(cm, annot=True, fmt="d")
plt.show()

From the above confusion matrix, we can observe that both svc and lr models predict only 1 value incorrect. 