In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# ABOUT DATASET

Any tumour (abnormal growth of cells) in the human body can be broadly classified into two types - **Benign** (non-cancerous cell growth) and **Malignant** (cancerous cell growth). This dataset is a collection of all those patients whose bodies were examined to have a tumour and have been classified to be either "Benign" or "Malignant" on the basis of a collection of features specific to the cell growth like radius of cells, surface area of growth, etc.

# ABOUT NOTEBOOK

In this notebook I have tried to create an algorithm to achieve the best possible accuracy while predicting the nature of the tumour present. For this, I will be comparing the accuracy of the different classification algorithms and then performing hyper-parameter tuning to achieve the best set of parameters giving the best accuracy for the models.

# IMPORTING THE LIBRARIES AND DATASET

### LIBRARIES

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

### DATASET

In [None]:
dataset = pd.read_csv ('../input/breast-cancer-wisconsin-data/data.csv')

# EXPLORATORY ANALYSIS OF DATASET

In this section, we will be performing some basic operations on the dataset in order to analyse the data as a s]whole. For example, we will checking out the size of the dataset, what are the different features, what are the input types of the features, etc. to name a few.

In [None]:
dataset.shape

In [None]:
dataset.columns

In [None]:
dataset.info ()

In [None]:
dataset.head ()

### REMOVING UNNEEDED FEATURES

The columns **id** and **Unnamed: 32** don't play any role in the prediction and hence we can drop them from the dataset.

In [None]:
dataset.drop ('id', axis = 1, inplace = True)
dataset.drop ('Unnamed: 32', axis = 1, inplace = True)

In [None]:
dataset.shape 

In [None]:
dataset.head ()

### CHECKING FOR MISSING DATA

Next we need to check for any missing data that might be present in the dataset. For this, we will be using the **isna ()** function of the **Pandas** library

In [None]:
dataset.isna ()

Since all the entries of the **isna ()** function are **false**, we can conclude that there is no missing data in the dataset.

### CHECKING THE NUMBER OF UNIQUE VALUES

Next we will be checking how many unique values does each feature have, in order to get a much better understanding of the dataset we are working on.

In [None]:
dict = {}
for i in list(dataset.columns):
    dict[i] = dataset[i].value_counts().shape[0]

pd.DataFrame(dict,index=["unique count"]).transpose()

From the result of the above function we can see that we have only 1 categorical data feature and the rest are continuous data features. 

### ENCODING THE CATEGORICAL VARIABLE

To ensure that the entire dataset is of a continuous numerical form, we will be encoding the categorial variable **DIAGNOSIS** and converting into a numerical form, preferably into 0s and 1s.

For this, we will be making use of the **LabelEncoder** class from the **Preprocessing** module of the **Sklearn** library

In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
dataset.diagnosis = labelencoder_Y.fit_transform(dataset.diagnosis)
dataset.head (10)

From the above table, it is clearly visible that the **DIAGNOSIS** feature is taking 0s and 1s as values.

0 --> **Benign**

1 --> **Malignant**

# BASIC VISUALISATION OF DATASET

After doing a theoretical analysis in the previous section, we will be moving on to visual analysis of the dataset. This will include a number of pair plots and a heat map between the different features and how they affect each other and how they will affect our algorithm. We will also be getting a general idea about which features will play a more active role while determining the accuracy of the model.

### CORRELATION MATRIX AND HEATMAP

The **Correlation Matrix** as the name suggests is a matrix which shows us how each feature variable of the dataset is co-related to each other.

We use the **Heatmap** as a visually pleasing way to show the relationships between features.

In [None]:
df = pd.DataFrame (dataset, columns = ['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se',	'perimeter_se',	'area_se',	'smoothness_se',	'compactness_se',	'concavity_se',	'concave points_se',	'symmetry_se',	'fractal_dimension_se',	'radius_worst',	'texture_worst',	'perimeter_worst',	'area_worst',	'smoothness_worst',	'compactness_worst',	'concavity_worst',	'concave points_worst',	'symmetry_worst',	'fractal_dimension_worst'])
df.corr ()

In [None]:
corr_Matrix = df.corr ()
f,ax = plt.subplots(figsize=(18, 18))
sns.heatmap (corr_Matrix, linewidths = 0.5, annot = True, fmt= '.1f',ax=ax)
plt.show ()

From the above heatmap we can infer a few things :-

1. The **"_se"** features are very weakly co-related (ranging from 0.0 to 0.5) with the response variable **DIAGNOSIS** and we won't be considering them for our final working dataset.

2. The **"_mean"** and **"_worst"** features apart from being very strongly co-related to the response variable **DIAGNOSIS** are also very strongly co-related to their corresponding selves. For example, **"radius_mean"** has a correlation of 1.0 with **"radius_worst"**. This implies that we don't need to consider both **"_mean"** and **"_worst"** features together and we can make use of either one of the sets. For this notebook, we will be making use of the **"_mean"** features.

From the **"_mean"** features we will be selecting only those which have a correlation of 0.5 and above with the response variable **DIAGNOSIS**.

### FORMING FINAL WORKING DATASET

Now that we have identified the key features that will play the major role while making predictions, we are going to drop the rest of the features from the dataset.

In [None]:
label = []
for i in range (30):
  if corr_Matrix.diagnosis[i+1]<0.5 or i>=10 :
    label.append (dataset.columns.values[i+1])
dataset.drop (labels = label, axis = 1, inplace = True)
dataset.head ()

The dataset now contains only those features which will be playing an important role in the classification of the type of the tumor and thus we will be using only these to train (and test) our model.

### PAIRPLOT VISUALS

Next is a visual representation of how the remaining features apart from the response feature **DIAGNOSIS** are related to each other.

In [None]:
sns.pairplot(dataset, hue = "diagnosis")
plt.show()

In the above plots, **0** corresponds to **BENIGN** and **1** corresponds to **MALIGNANT**.

The visuals show a trend that has been followed throughout each plot. At lower values of the features, the diagnosis is predominantly **BENIGN** and at higher values, **MALIGNANT** has been the chief diagnosis. 


### COUNTPLOT VISUAL

The count plot will give us a more clearer picture regarding the actual number of data points for each diagnosis.

In the plot below, **0** corresponds to **BENIGN** and **1** corresponds to **MALIGNANT**.

In [None]:
sns.countplot (x = 'diagnosis',data = dataset)
plt.show ()

From the count plot, it can be easily inferred that there are more **BENIGN** diagnosed data points than **MALIGNANT** diagnosed data points.

# DATA PREPROCESSING

This section involves transforming the raw data that we have into a more understandable format for the algorithm to process. This is done so that the data which we will be feeding into the algorithm is not garbage and we don't get false predictions in return. This includes techniques like scaling of features, encoding of data, splitting the data into training and test sets, etc.

### SPLITTING DATASET INTO DEPENDENT AND INDEPENDENT VARIABLES

Now finally we will be splitting the updated dataset we have into two parts. The first is a collection of the independent variables and is called the **MATRIX OF FEATURES**. The other is a collection of the dependent variables and is known as **RESPONSE FEATURE**.

In [None]:
# X = Matrix of Features
# Y = Response Feature

X = dataset.iloc [:, 1:].values
Y = dataset.iloc [:, 0].values
X.shape

### SPLITTING THE MATRIX OF FEATURES AND RESPONSE FEATURE INTO TRAINING AND TEST SETS

As the names suggest, the model algorithms are trained using the **TRAINING SET** and then the model algorithms apply their learnings from training onto the **TEST SET** to get the predicted values which are then compared to the actual values to get the accuracy.


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split (X, Y, test_size = 0.25, random_state = 1)

In [None]:
X_train.shape

In [None]:
Y_train.shape

In [None]:
X_test.shape

In [None]:
Y_test.shape

### FEATURE SCALING

In any dataset, there could be features that dominate over others while evaluating the accuracy. We don't want that. We want all features to have a more or less equal say in deciding the accuracy. Also if a feature in the dataset is big in scale compared to others then in algorithms where Euclidean Distance is measured this big scaled feature becomes dominating and needs to be normalized.

For this we'll be using one of the most used feature scaling method there is, **STANDARD SCALER**. This method assumes your data to be normally distributed within each feature and scales them in such a way that the distribution becomes centred around **0** with a standard deviation of **1**.

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler ()
X_train = sc.fit_transform (X_train)
X_test = sc.transform (X_test)

In [None]:
print (X_train [:5, :])

In [None]:
print (X_test [:5, :])

# MODEL IMPLEMENTATIONS

In this section we will be building a model using the different classification algorithms that we have like K-NN, Logistic Regression, Naive Bayes, etc. and calculating the model accuracy for each algorithm to see which would be best suited for our dataset.

Also, we will be optimizing our model using methods which could result in a new classification algorithm having the best accuracy.

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score

### LOGISTIC REGRESSION MODEL

In [None]:
from sklearn.linear_model import LogisticRegression 
classifier_log = LogisticRegression ()
classifier_log.fit (X_train, Y_train)
Y_pred_log = classifier_log.predict (X_test)
cm_log = confusion_matrix (Y_test, Y_pred_log)
acc_log = accuracy_score (Y_test, Y_pred_log)

### K-NN MODEL

In [None]:
from sklearn.neighbors import KNeighborsClassifier
classifier_knn = KNeighborsClassifier ()
classifier_knn.fit (X_train, Y_train)
Y_pred_knn = classifier_knn.predict (X_test)
cm_knn = confusion_matrix (Y_test, Y_pred_knn)
acc_knn = accuracy_score (Y_test, Y_pred_knn)

### NAIVE BAYES MODEL

In [None]:
from sklearn.naive_bayes import GaussianNB
classifier_nb = GaussianNB ()
classifier_nb.fit (X_train, Y_train)
Y_pred_nb = classifier_nb.predict (X_test)
cm_nb = confusion_matrix (Y_test, Y_pred_nb)
acc_nb = accuracy_score (Y_test, Y_pred_nb)

### SVM MODEL

In [None]:
from sklearn.svm import SVC
classifier_svm = SVC (kernel = 'rbf', random_state = 0)
classifier_svm.fit (X_train, Y_train)
Y_pred_svm = classifier_svm.predict (X_test)
cm_svm = confusion_matrix (Y_test, Y_pred_svm)
acc_svm = accuracy_score (Y_test, Y_pred_svm)

### DECISION TREE MODEL

In [None]:
from sklearn.tree import DecisionTreeClassifier
classifier_dtc = DecisionTreeClassifier (criterion = 'entropy', random_state = 0)
classifier_dtc.fit (X_train, Y_train)
Y_pred_dtc = classifier_dtc.predict (X_test)
cm_dtc = confusion_matrix (Y_test, Y_pred_dtc)
acc_dtc = accuracy_score (Y_test, Y_pred_dtc)

### RANDOM FOREST MODEL

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier_rfc = RandomForestClassifier (n_estimators = 100, criterion = 'entropy', random_state = 1)
classifier_rfc.fit (X_train, Y_train)
Y_pred_rfc = classifier_rfc.predict (X_test)
cm_rfc = confusion_matrix (Y_test, Y_pred_rfc)
acc_rfc = accuracy_score (Y_test, Y_pred_rfc)

### ACCURACY COMPARISON

In [None]:
prediction_columns = ["NAME OF MODEL", "ACCURACY SCORE"]
df_pred = {"NAME OF MODEL" : ["LOGISTIC REGRESSION", "K-NN", "NAIVE BAYES", "SVM", "DECISION TREE", "RANDOM FOREST"],
           "ACCURACY SCORE " : [acc_log, acc_knn, acc_nb, acc_svm, acc_dtc, acc_rfc]}
df_predictions = pd.DataFrame (df_pred)
df_predictions

From the table above it is fairly evident that the **SUPPORT VECTOR MACHINE** has the highest accuracy score of **0.923077 (92.30%)** for our dataset.

# HYPER-PARAMETER TUNING

We wil now try to tune our model algorithms and see whether is it possible for us to achieve any increase in the accuracy  scores by making any changes in the parameter values. The technique that we will be using provides us with the optimum parameter values using which we can get the maximum accuracy possible. Also, there is the possibilty that a new model is found to have the highest accuracy after the paramter tuning is done.

In [None]:
from sklearn.model_selection import GridSearchCV

### LOGISTIC REGRESSION MODEL

In [None]:
parameters = [{'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
                'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}]
grid_search = GridSearchCV(estimator = classifier_log,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search.fit(X_train, Y_train)
best_accuracy_log = grid_search.best_score_
best_parameters = grid_search.best_params_
print(best_accuracy_log)
print(best_parameters)

### K-NN MODEL

In [None]:
parameters = [{'n_neighbors': [3,5,7,10,13,15], 'weights': ['uniform', 'distance'],
                'p': [1,2]}]
grid_search = GridSearchCV(estimator = classifier_knn,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search.fit(X_train, Y_train)
best_accuracy_knn = grid_search.best_score_
best_parameters = grid_search.best_params_
print(best_accuracy_knn)
print(best_parameters)

### NAIVE BAYES MODEL

The naive bayes algorithm doesn't have any hyper-parameter to tune, so we have nothing to perform grid search over.

### SVM MODEL

In [None]:
parameters = [{'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'kernel': ['linear', 'rbf'],
                'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}]
grid_search = GridSearchCV(estimator = classifier_svm,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search.fit(X_train, Y_train)
best_accuracy_svm = grid_search.best_score_
best_parameters = grid_search.best_params_
print(best_accuracy_svm)
print(best_parameters)

### DECISION TREE MODEL

In [None]:
parameters = [{'criterion':['gini','entropy'],'max_depth':[4,5,6,7,8,9,10,11,12,15,20,30,40,50,70,90,120,150], 
                'max_leaf_nodes': [2,4,6,10,15,30,40,50,100], 'min_samples_split': [2, 3, 4]}]
grid_search = GridSearchCV(estimator = classifier_dtc,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search.fit(X_train, Y_train)
best_accuracy_dtc = grid_search.best_score_
best_parameters = grid_search.best_params_
print(best_accuracy_dtc)
print(best_parameters)

### RANDOM FOREST MODEL

In [None]:
parameters = [{'n_estimators': [100,200,300],
               'max_features': ['auto', 'sqrt'],
               'max_depth': [10,25,50,'none'],
               'min_samples_leaf': [1, 2], 
               'min_samples_split': [2, 5]}]
grid_search = GridSearchCV(estimator = classifier_rfc,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search.fit(X_train, Y_train)
best_accuracy_rfc = grid_search.best_score_
best_parameters = grid_search.best_params_
print(best_accuracy_rfc)
print(best_parameters)

# FINAL ACCURACIES AFTER HYPER-PARAMETER TUNING

In [None]:
prediction_columns = ["NAME OF MODEL", "ACCURACY SCORE", "BEST ACCURACY (AFTER HYPER-PARAMETER TUNING)"]
df_pred = {"NAME OF MODEL" : ["LOGISTIC REGRESSION", "K-NN", "NAIVE BAYES", "SVM", "DECISION TREE", "RANDOM FOREST"],
           "ACCURACY SCORE " : [acc_log, acc_knn, acc_nb, acc_svm, acc_dtc, acc_rfc],
           "BEST ACCURACY (AFTER HYPER-PARAMETER TUNING)" : [best_accuracy_log, best_accuracy_knn, "-", best_accuracy_svm, best_accuracy_dtc, best_accuracy_rfc]}
df_predictions = pd.DataFrame (df_pred)
df_predictions

# CONCLUSION

To conclude this notebook, it is fairly evident that it is the **SUPPORT VECTOR MACHINE** model that has come out triumphant with the highest accuracies both before and after the hyper-parameter tuning. It ended up with an accuracy of **0.923077 (92.37%)** before hyper-parameter tuning and **0.936434 (93.64%)** after and is hence, the best suited model out of the rest for the given dataset.