<h1 style='font-size:40px;background:PINK; border:0; color:white'><center> BREAST CANCER PREDICTION USING SVM</center></h1>

<center><img src="https://nationaltoday.com/wp-content/uploads/2019/10/breast-cancer-aware.jpg"></center>

<h2 style='text-align:center;font-size:30px;background-color:black;border:20px;color:white'>TABLE OF CONTENTS<h2>

<a id="10"></a>

* [IMPORTING LIBRARIES](#1)
* [MISSING VALUES](#2)
* [EXPLORATORY DATA ANALYSIS](#3)
* [FEATURE SCALING](#4)
* [MODEL BUILDING](#5)
* [MODEL EVALUATION](#6)
* [PARAMETER TUNING](#7)
* [FEATURE IMPORTANCE](#8)

<a id="1"></a>
<h2 style='font-size:30px;background:black; border:0; color:white'><center> IMPORTING LIBRARIES </center><h2>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

%matplotlib inline

In [None]:
pd.set_option('display.max_columns',40)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV

In [None]:
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff

In [None]:
cancer = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')

In [None]:
cancer.head()

In [None]:
cancer.columns

In [None]:
print("Cancer dataset dimensions : {}".format(cancer.shape))
print()
print("Rows:",cancer.shape[0])
print()
print("Columns:",cancer.shape[1])

In [None]:
cancer.describe().T

<font size='3'>We have columns Unnamed:32 and id which doesn't making sense to keep it anymore. We can drop it!</font>

In [None]:
cancer.drop(['Unnamed: 32','id'],1,inplace=True)

In [None]:
cancer.head()

<a id="2"></a>
<h2 style='font-size:30px;background:black; border:0; color:white'><center> MISSING VALUES </center><h2>

In [None]:
cancer.isnull().any().any()

<font size='3'>Hurray! We have no missing records</font>

<a id="3"></a>
<h2 style='font-size:30px;background:black; border:0; color:white'><center>EXPLORATORY DATA ANALYSIS</center><h2>

In [None]:
trace = go.Pie(labels = ['benign','malignant'], values = cancer['diagnosis'].value_counts(), 
               textfont=dict(size=15), opacity = 0.8,
               marker=dict(colors=['pink', 'purple'], 
               line=dict(color='#000000', width=1.5)))
           

layout= go.Layout(
        title={
        'text': "Distribution of diagnosis variable",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})

fig = go.Figure(data = [trace], layout=layout)
fig.show()

<font size='3'>We have more Benign cases than the Malignant</font>

In [None]:
cancer['diagnosis']= cancer['diagnosis'].map({'M':1,'B':0})

In [None]:
M = cancer[(cancer['diagnosis'] != 0)]
B = cancer[(cancer['diagnosis'] == 0)]

In [None]:
def plots(column, bin_size) :  
    temp1 = M[column]
    temp2 = B[column]
    
    hist_data = [temp1, temp2]
    
    group_labels = ['Malignant', 'Benign']
    colors = ['purple', 'pink']

    fig = ff.create_distplot(hist_data, group_labels, colors = colors, show_hist = True, bin_size = bin_size, curve_type='kde')
    
    fig['layout'].update(title = column)
    fig.show()

In [None]:
plots('radius_mean', .5)
plots('texture_mean', .5)
plots('perimeter_mean',5)
plots('area_mean',15)

In [None]:
plots('radius_se', .1)
plots('texture_se', .1)
plots('perimeter_se', .5)
plots('area_se', 5)

In [None]:
plots('radius_worst', .5)
plots('texture_worst', .5)
plots('perimeter_worst', 5)
plots('area_worst', 10)

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(cancer.corr(),annot=True)

In [None]:
sns.scatterplot(x='area_mean',y='smoothness_mean',hue='diagnosis',data=cancer)

In [None]:
cancer.columns

In [None]:
features = ['radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst']

In [None]:
X =cancer.iloc[:,1:32].values
y =cancer['diagnosis']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=22,stratify=y)

<a id="4"></a>
<h2 style='font-size:30px;background:black; border:0; color:white'><center>FEATURE SCALING</center><h2>

<font size='3'>Because Support Vector Machine (SVM) optimization occurs by minimizing the decision vector w, the optimal hyperplane is influenced by the scale of the input features and it's therefore recommended that data be standardized (mean 0, var 1) prior to SVM model training.</font>

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test= scaler.transform(X_test)

<a id="5"></a>
<h2 style='font-size:30px;background:black; border:0; color:white'><center>MODEL BUILDING</center><h2>

In [None]:
model = SVC(kernel='linear')
model.fit(X_train,y_train)

In [None]:
y_pred = model.predict(X_test)

<a id="6"></a>
<h2 style='font-size:30px;background:black; border:0; color:white'><center>MODEL EVALUATION</center><h2>

In [None]:
cnf = confusion_matrix(y_test,y_pred)
sns.heatmap(cnf,annot=True,cmap='summer',fmt='g')

In [None]:
acc = accuracy_score(y_test,y_pred)
print("Accuracy:",acc)

In [None]:
print(classification_report(y_test,y_pred))

No doubt we have good accuracy of 98% but let's see if our model can get better with parameter tuning! Let's go 

<a id="7"></a>
<h2 style='font-size:30px;background:black; border:0; color:white'><center>PARAMETER TUNING</center><h2>

In [None]:
param_grid={'C':[0.1,1,10,100,1000],
            'gamma':[1,0.1,0.01,0.001,0.0001],
            'kernel':['rbf']}

In [None]:
grid= GridSearchCV(SVC(),param_grid,refit=True,verbose=4)
grid.fit(X_train,y_train)

In [None]:
grid.best_params_

In [None]:
grid.best_score_

In [None]:
g_pred = grid.predict(X_test)

In [None]:
g_cnf = confusion_matrix(y_test,g_pred)
sns.heatmap(g_cnf,annot=True,fmt='g',cmap='Blues')

In [None]:
g_acc = accuracy_score(y_test,g_pred)
print("Accuracy with GridSearch:",g_acc)

In [None]:
print(classification_report(y_test,g_pred))

Our model has done best with 98% accuracy without parameter tuning!

<a id="8"></a>
<h2 style='font-size:30px;background:black; border:0; color:white'><center>FEATURE IMPORTANCE</center><h2>

In [None]:
coef= model.coef_
coeffs = np.squeeze(coef)
coeffs

In [None]:
coefs = pd.DataFrame({"Features":features,"Coefficients":coeffs})
feature_imp = coefs.sort_values(by='Coefficients',ascending=False)

In [None]:
feature_imp

In [None]:
plt.figure(figsize=(15,10))
sns.barplot(y='Features',x='Coefficients',data=feature_imp)

## [GO TO TOP](#10)

<center><img src="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcQpTjNIwP4--vaPwInGxqMufHlWLQjCnRPLEg&usqp=CAU![image.png](attachment:image.png)"><center

<center><img src="https://images-na.ssl-images-amazon.com/images/I/613f9N0BiJL._SL1500_.jpg"></center>