**Prediction of Breast Cancer : Benign or Malignant**

To Predict if the cancer diagnosis is benign or malignant based on several observations/features
30 features are used, examples:

  - radius (mean of distances from center to points on the perimeter)
  - texture (standard deviation of gray-scale values)
  - perimeter
  - area
  - smoothness (local variation in radius lengths)
  - compactness (perimeter^2 / area - 1.0)
  - concavity (severity of concave portions of the contour)
  - concave points (number of concave portions of the contour)
  - symmetry 
  - fractal dimension ("coastline approximation" - 1)
Datasets are linearly separable using all 30 input features

Number of Instances: 569
Class Distribution: 212 Malignant, 357 Benign
Target class:
   - Malignant
   - Benign

<img src="https://image.ibb.co/kxSX49/c1PNG.png" alt="class">

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
import matplotlib.pyplot as plt # Import matplotlib for data visualisation
import seaborn as sns
import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
df_cancer = pd.read_csv("../input/breast-cancer.csv")

In [None]:
df_cancer.head()

**Mapping Diagnosis variable which is our Target variable to 0,1 : 1 for Malignant 0: Benign**

In [None]:
df_cancer.loc[:,'diagnosis'] = df_cancer.diagnosis.map({'M':1, 'B':0})

In [None]:
df_cancer.head()

**Just looking at the tail or last 5 entries to know the distribution of data **

In [None]:
df_cancer.tail()

In [None]:
df_cancer.shape

**Means there are 569 rows and 33 columns**

**Checking if there are any nulls in any column**

In [None]:
df_cancer.isnull().sum()

**As we have seen there are no null entries except Unnamed:32 so we will delete it later before training **

Looking at the Visualization of important features in relation to target variable diagnosis to see on which features it is more related

In [None]:
sns.pairplot(df_cancer, hue = 'diagnosis', vars = ['radius_mean', 'texture_mean', 'area_mean', 'perimeter_mean', 'smoothness_mean'] )

In [None]:
sns.countplot(df_cancer['diagnosis'], label = "Count") 

In [None]:
sns.scatterplot(x = 'area_mean', y = 'smoothness_mean', hue = 'diagnosis', data = df_cancer)

In [None]:
sns.lmplot('area_mean', 'smoothness_mean', hue ='diagnosis', data = df_cancer, fit_reg=False)

In [None]:
fig = sns.FacetGrid(df_cancer, hue="diagnosis",aspect=4)

# Next use map to plot all the possible kdeplots for the 'Age' column by the hue choice
fig.map(sns.kdeplot,'smoothness_mean',shade= True)

# Set the x max limit by the oldest passenger
oldest = df_cancer['smoothness_mean'].max()

#Since we know no one can be negative years old set the x lower limit at 0
fig.set(xlim=(0,oldest))

#Finally add a legend
fig.add_legend()

In [None]:
fig = sns.FacetGrid(df_cancer, hue="diagnosis",aspect=4)

# Next use map to plot all the possible kdeplots for the 'Age' column by the hue choice
fig.map(sns.kdeplot,'texture_mean',shade= True)

# Set the x max limit by the oldest passenger
oldest = df_cancer['texture_mean'].max()

#Since we know no one can be negative years old set the x lower limit at 0
fig.set(xlim=(0,oldest))

#Finally add a legend
fig.add_legend()

In [None]:
fig = sns.FacetGrid(df_cancer, hue="diagnosis",aspect=4)

# Next use map to plot all the possible kdeplots for the 'Age' column by the hue choice
fig.map(sns.kdeplot,'area_mean',shade= True)

# Set the x max limit by the oldest passenger
oldest = df_cancer['area_mean'].max()

#Since we know no one can be negative years old set the x lower limit at 0
fig.set(xlim=(0,oldest))

#Finally add a legend
fig.add_legend()

In [None]:
sns.factorplot('texture_mean','area_mean',hue='diagnosis',data=df_cancer)

In [None]:
sns.scatterplot('concavity_se', 'radius_mean', hue ='diagnosis', data = df_cancer)


In [None]:
sns.scatterplot('compactness_se', 'radius_mean', hue ='diagnosis', data = df_cancer)

**Checking the correlation among different features and target variable diagnosis**

Apart from visualization. But if you want to  see numbers and stats, then there are other ways to find out how data correlates.

Pearson’s Correlation Coefficient helps you find out the relationship between two quantities. It gives you the measure of the strength of association between two variables. The value of Pearson’s Correlation Coefficient can be between -1 to +1.

1 means that they are highly correlated and 0 means no correlation. -1 means that there is a negative correlation. Think of it as an inverse proportion.

The t-test is a correlation coefficient testing for any correlation between two values.

Other popular correlation coefficients include

**Spearman rank order correlation**
**Pearson’s Rank Correlation.**

The importance of data correlation has an effect when you have a dataset with many features. It’s tempting to think that a larger number of features will help a model make better predictions. But that’s incorrect.

If you try to train a model on a set of features with no or very little correlation, you will get inaccurate results.

In [None]:
plt.figure(figsize=(24,12)) 
sns.heatmap(df_cancer.corr(), annot=True) 

As we have seen Unnamed: 32 is a column with full of NAN and some of the other features such as id is also non contributing feature in cancer prediction so we will be dropping those features before training the model.

In [None]:
unwantedcolumnlist=["diagnosis","Unnamed: 32","id"]

In [None]:
X = df_cancer.drop(unwantedcolumnlist,axis=1)

In [None]:
y = df_cancer['diagnosis']

**Now, we will split training and testing dataset using sklearn to X_train, X_test,y_train,y_test**

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=5)

I beleive for this problem **Support Vector Machines** are good classification algorithm for this problem.
Now , lets know <br>
**What is a Support Vector and what is SVM?**
A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be employed for both classification and regression purposes. SVMs are more commonly used in classification problems and as such, this is what we will focus on in this post.

SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes, as shown in the image below
<img src="https://image.ibb.co/g0rmWp/svm1.png" alt="class">

Support vectors are the data points nearest to the hyperplane, the points of a data set that, if removed, would alter the position of the dividing hyperplane. Because of this, they can be considered the critical elements of a data set.

**What is a hyperplane?**
 
As a simple example, for a classification task with only two features (like the image above), you can think of a hyperplane as a line that linearly separates and classifies a set of data.

Intuitively, the further from the hyperplane our data points lie, the more confident we are that they have been correctly classified. We therefore want our data points to be as far away from the hyperplane as possible, while still being on the correct side of it.

So when new testing data is added, whatever side of the hyperplane it lands will decide the class that we assign to it.

**How do we find the right hyperplane?**
 
Or, in other words, how do we best segregate the two classes within the data?

The distance between the hyperplane and the nearest data point from either set is known as the margin. The goal is to choose a hyperplane with the greatest possible margin between the hyperplane and any point within the training set, giving a greater chance of new data being classified correctly.
<img src="https://image.ibb.co/mU6Lrp/svm2.png" alt="class">

In [None]:
from sklearn.svm import SVC 
from sklearn.metrics import classification_report, confusion_matrix

svc_model = SVC()
svc_model.fit(X_train, y_train)

In [None]:
y_predict = svc_model.predict(X_test)
cm = confusion_matrix(y_test, y_predict)
cm

In [None]:
sns.heatmap(cm, annot=True)

In [None]:
print(classification_report(y_test, y_predict))

In [None]:
min_train = X_train.min()
min_train

In [None]:
range_train = (X_train - min_train).max()
range_train

In [None]:
X_train_scaled = (X_train - min_train)/range_train

In [None]:
X_train_scaled.head()

In [None]:
sns.scatterplot(x = X_train['area_mean'], y = X_train['smoothness_mean'], hue = y_train)

In [None]:
sns.scatterplot(x = X_train_scaled['area_mean'], y = X_train_scaled['smoothness_mean'], hue = y_train)

In [None]:
min_test = X_test.min()
range_test = (X_test - min_test).max()
X_test_scaled = (X_test - min_test)/range_test

In [None]:
from sklearn.svm import SVC 
from sklearn.metrics import classification_report, confusion_matrix

svc_model = SVC()
svc_model.fit(X_train_scaled, y_train)

In [None]:
y_predict = svc_model.predict(X_test_scaled)
cm = confusion_matrix(y_test, y_predict)

sns.heatmap(cm,annot=True,fmt="d")

In [None]:
print(classification_report(y_test,y_predict))

**Improving the model using** **GridSearchCV** <br>
what is **GridSearch ?** <br>
Grid Search is an algorithm with the help of which we can tune hyper-parameters of a model. We pass the hyper-parameters to tune, the possible values for each hyper-parameter and a performance metric as input to the grid search algorithm. The algorithm will then place all the possible hyper-parameter combination in a grid and then find the performance of the model for each combination against some cross-validation set. Then it outputs the hyper-parameter combination that gives the best result.

GridSearch is generally used when you are not sure of good values for a parameter. We might have a range of parameter values that you think would work out for the model and you want to test which one of them to use. In that situation it becomes tedious for us to train the model again and again with different parameters and in that situation we use gridSearch.

Let us suppose that : To solve the optimization problem for a fixed set of values of hyperparameters α and β gives us a value of w. Since the optimal value of w (call it w∗) is a function of α and β, we can write it as follows:

**w∗(α,β)=argminwP(w,α,β,Strain)**

Now we use this w∗ to predict on the validation sample to get validation error. We can view this scenario in terms of a "validation error function": the function takes as inputs the hyperparameters α and β, and returns the validation error corresponding to w∗(α,β).

So the goal of hyperparameter optimization is to find the set of values of α and β, that minimize this validation error function.

So, in Grid Search technique, it picks a bunch of values of α -- (α1,α2,…) and a bunch of values of β -- (β1,β2,…) and for each pair of values, evaluates the validation error function. Then pick the pair that gives the minimum value of the validation error function.

The pairs (α1,β1),(α1,β2),…,(α2,β1),(α2,β2),… when plotted in space look like a grid, hence the name.
<img src="https://image.ibb.co/gJFfSU/grid123.png" alt="claass" border="0">

In [None]:
param_grid = {'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001], 'kernel': ['rbf']} 

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=4)

In [None]:
grid.fit(X_train_scaled,y_train)

In [None]:
grid.best_params_

In [None]:
grid.best_estimator_

In [None]:
grid_predictions = grid.predict(X_test_scaled)

In [None]:
cm = confusion_matrix(y_test, grid_predictions)
sns.heatmap(cm,annot=True,fmt="d")

In [None]:
print(classification_report(y_test,grid_predictions))

That's Great we have achieved 97% accuracy and precision to detect malignant or benign breast cancer.

**Reference **  <br>
1.   https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
2.  https://www.kdnuggets.com/2016/07/support-vector-machines-simple-explanation.html