![](https://mlfjqdsf5ptg.i.optimole.com/iQrIoNc-LQvF_N5U/w:800/h:400/q:69/https://nationaldaycalendar.com/wp-content/uploads/2014/10/Breast-Cancer-Awareness-Month-October-1.jpg)

# Table of Contents

  
- Table of Contents

- First look at the dataset

- EDA

   - Checking for Missing Values
   
   - Basic Statistical Details
  
   - Correlation Heatmap
      - Highly correlated pairs
      - Inverse correlated pairs
      - Low correlated pairs
      
           
       
- Data Visualization

    - Feature Pairs
    - Scatter Plot
    - Count Plot
    - Histogram 
    - Joint Plot
    
    
    
- Pre-Modeling Tasks

   - Separating the independant and the dependant variable
   - Splitting the dataset 
   - Feature Scaling
   
   
   
- Modeling

   - Logistic Regression
   - Gradient Boosting Classifier
   - Random Forest Classifier
   - Decision Tree Classifier
   - KNeighbors Classifier
   - XGB Classifier
   - Suport Vector Machine
   
   
- Evaluation and comparision of all the models

  - Classification Accuracy

  - Confusion matrix

  - Precision

  - Recall

  - classification_report

  - ROC AUC Score

  - Area under curve (AUC)
   
    
- Resources

# Loading the libraries and the dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


# Pre-Modeling Tasks

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Modeling

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC


# Evaluation and comparision of all the models


from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score, recall_score, confusion_matrix, precision_recall_fscore_support
from sklearn.metrics import roc_auc_score,auc,f1_score
from sklearn.metrics import precision_recall_curve,roc_curve

In [None]:
df = pd.read_csv("../input/breast-cancer-wisconsin-data/data.csv")

# Look at the dataset

Attribute Information:

- 1) ID number

- 2) Diagnosis (M = malignant, B = benign)
  

Ten real-valued features are computed for each cell nucleus:

- a) radius (mean of distances from center to points on the perimeter)
- b) texture (standard deviation of gray-scale values)
- c) perimeter
- d) area
- e) smoothness (local variation in radius lengths)
- f) compactness (perimeter^2 / area - 1.0)
- g) concavity (severity of concave portions of the contour)
- h) concave points (number of concave portions of the contour)
- i) symmetry
- j) fractal dimension ("coastline approximation" - 1)

****Check the target variable:****

- Malignant = 1 (indicates prescence of cancer cells)

- Benign = 0 (indicates abscence)

****What is the difference between Malignant and Benign ?****

![Differences Between a Malignant and Benign Tumor](https://gotalktogetherdotcom.files.wordpress.com/2016/05/cancerbenignmalig1.jpg?w=550)

Loving Biology - WordPress.com

How many Benign and Malignant do we have in our dataset?

In [None]:
df['diagnosis'].value_counts()

As we can see, we have 212 - Malignant, and 357 - Benign

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.shape

# Exploratory Data Analysis

In [None]:
df.info()

## Basic Statistical Details

In [None]:
# describing the dataset

df.describe().T

## Checking for missing values

Machine Learning algorithm generally, cannot work with missing values, so before we launch a machine learning algorithm we must cleaning the dataset, we will remove the features that doesn't affect the model 

In [None]:
df.isnull().sum()

In [None]:
# Deleting the id and Unnamed column

df= df.drop(['Unnamed: 32','id'],axis=1)

## Checking for the correlation

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(df.corr(),annot=True)
plt.ioff()

### Highly correlated pairs

In [None]:

palette ={'B' : 'lightblue', 'M' : 'magenta'}


fig = plt.figure(figsize=(12,12))
def plot_scatter(a,b,k):
    plt.subplot(k)
    sns.scatterplot(x = df[a], y = df[b], hue = "diagnosis",
                    data = df, palette = palette)
    plt.title(a + ' vs ' + b,fontsize=15)
    
plot_scatter('texture_mean','texture_worst',221) 
plot_scatter('area_mean','radius_worst',222) 
plot_scatter('perimeter_mean','radius_worst',223)  
plot_scatter('perimeter_mean','radius_worst',224) 


### Inverse correlated pairs

In [None]:
fig = plt.figure(figsize=(12,12))

  
plot_scatter('smoothness_mean','texture_mean',221) 
plot_scatter('texture_mean','symmetry_se',222) 
plot_scatter('fractal_dimension_worst','texture_mean',223) 
plot_scatter('texture_mean','symmetry_mean',224)
  


### Low correlated pairs

In [None]:
fig = plt.figure(figsize=(12,12))
plot_scatter('area_mean','fractal_dimension_mean',221)
plot_scatter('radius_mean','fractal_dimension_mean',222)
plot_scatter('area_mean','smoothness_se',223)
plot_scatter('smoothness_se','perimeter_mean',224)

# Data Visualization

## PairPlot

In [None]:
from pylab import rcParams

rcParams['figure.figsize'] = 8,5

cols = ['radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean','diagnosis']

sns_plot = sns.pairplot(data=df[cols],hue='diagnosis', palette='bwr')

## ScatterPlot

In [None]:
# area_mean vs smoothness_mean

sns.scatterplot(x= 'area_mean', y= 'smoothness_mean', hue= 'diagnosis', data=df, palette='CMRmap')

In [None]:
# texture mean vs radius_mean

size = len(df['texture_mean'])

area = np.pi * (15 * np.random.rand( size ))**2
colors = np.random.rand( size )

plt.xlabel("texture mean")
plt.ylabel("radius mean") 
plt.scatter(df['texture_mean'], df['radius_mean'], s=area, c= colors, alpha=0.5)

## Count Plot

In [None]:
# Target variable

sns.countplot(df['diagnosis'],palette='Paired')

## Histogram

In [None]:
m = plt.hist(df[df["diagnosis"] == "M"].radius_mean,bins=30,fc = (1,0,0,0.5),label = "Malignant")
b = plt.hist(df[df["diagnosis"] == "B"].radius_mean,bins=30, fc = (1,0,0.5), label= "Bening")

plt.legend()
plt.xlabel ("Radius Mean Values")
plt.ylabel ("Frequency")
plt.title("Histogram of Radius Mean for Bening and Malignant Tumors")
plt.show()

## JointPlot

In [None]:
sns.jointplot(data= df, x='area_mean', y='smoothness_mean', size=5)

# Encoding categorical data

As we know machine learning algorithms can only read numerical values. It is essential to encoding categorical features into numerical values.

In [None]:
# Label Encoder

LEncoder = LabelEncoder()

df['diagnosis'] = LEncoder.fit_transform(df['diagnosis'])

So we have encoded malignan as 1 and benign as 0

# Pre-Modeling Tasks

## Separating the independant and the dependant variable

In [None]:
X = df.drop('diagnosis',axis=1).values
y = df['diagnosis'].values

## Splitting the dataset

In Machine learning we must split the dataset into training and testing data:

 - the training set called also learning set that we will use to train our model, it has the big part.

 - the testing set: is used to evaluate the performance of the model after hypermarameter tuning, It's also useful to get an idea of how different models (SVMs, Neural Networks,    Random forests...) perform against each other.

- So creating the test set is easy, we just select a few rondom rows, in general we give it 10%  or 20%.

- SKit_Learn provides a function of splitting the dataset into multiples subsets. 


- train_test_split(), is the simplest way wich the same as the function: split_train_test(), the method accepts lists, numpy arrays, scipy sparse matrices or pandas dataframes.

  We will also identify some parameters, like the random_state that allows you to set the random generator seed.

- The ideal split is said to be 80:20 for training and testing. You may need to adjust it depending on the size of the dataset and parameter complexity.

In [None]:
random_state = 42

x_train, x_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=random_state)

# Feature Scaling

Feature scaling is a method used to standardize the range of independent variables or features of data. Scaling the data is very important to boost the score.

Feature Scaling, is a step of Data Pre Processing which is applied to independent variables or features of data. It basically helps to normalise the data within a particular range.


There are two ways for scaling the dataset:
 
 -Standardization
 
 -Min_Max Scaling
 
- Standardization : it substract the mean value( so standardized values always have a zero mean), and then it divides by the standard deviation, this method doesn't have a         specific range from 0 to 1, that may cause a problem for many algorithms like Neural Network often expect an input value ranging from 0 to 1. 
 
  Sckit-Learn provides a transformer caller **StandardScaler**. The idea behind **StandardScaler** is that it will transform your data such that its distribution will have a       mean value 0 and standard deviation of 1.
 

- Min_Max : called also Normalization, is the simplest way to scaling data, values are shifted and rescaled again so that the end up ranging from 0 to 1. we do this by             substraction the min value and dividing by the Max minus the Min.

  Sckit-learn provides a transformer callec **MinMaxScaler**.  It have a hyperparameter called "Feature Range" to specify the range that you want.

In [None]:
sc = StandardScaler()

X_train = sc.fit_transform(x_train)
X_test= sc.transform(x_test)

# Modeling

- In this part we'll try differents models of Machine learning: Logistic Regression, Gradient Boosting Classifier,Random Forest,XGB Classifier,


  Support Vector Machine, Decision  tree and KNeighbors Model

In [None]:
# Logistic Regression


logreg= LogisticRegression()

logreg.fit(X_train, y_train)

y_pred_logreg = logreg.predict(X_test)


# Gradient Boosting Classifier


GB = GradientBoostingClassifier()

GB.fit(X_train, y_train)

y_pred_GB = GB.predict(X_test)



# Random Forest Classifier

rf = RandomForestClassifier()

rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)


# Decision Tree Classifier

dt = DecisionTreeClassifier()

dt.fit(X_train, y_train)

y_pred_dt = dt.predict(X_test)


# KNeighbors Classifier


knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train, y_train)

y_pred_knn = knn.predict(X_test)


# XGB Classifier

XGB = XGBClassifier() 

XGB.fit(X_train, y_train)

y_pred_XGB = XGB.predict(X_test)



# Support Vector classifier

svc = SVC(probability=True)

svc.fit(X_train,y_train)

y_pred_svc = svc.predict(X_test)


In [None]:
X_train.shape, y_train.shape,X_test.shape, y_test.shape

# Evaluation and comparison of all the models

In [None]:
models = []

Z = [SVC() , DecisionTreeClassifier() , LogisticRegression() , KNeighborsClassifier() ,XGBClassifier(),
    RandomForestClassifier() , GradientBoostingClassifier()]


X = ["SVC" , "DecisionTreeClassifier" , "LogisticRegression" , "KNeighborsClassifier" ,
    "RandomForestClassifier" , "GradientBoostingClassifier", "XGB"]

for i in range(0,len(Z)):
    model = Z[i]
    model.fit( X_train , y_train )
    pred = model.predict(X_test)
    models.append(accuracy_score(pred , y_test))   

In [None]:
d = { "Accuracy" : models , "Algorithm" : X }
data_frame = pd.DataFrame(d)
data_frame

In [None]:
sns.barplot(data_frame['Accuracy'],data_frame['Algorithm'],palette= "husl").set_title('Accuracy of all Algorithms')

As we see, from the above table and graph, that SVC classifier works best for this dataset

# Evaluating The Performance of the model

Evaluating the machine learning model is a crucial part in any data science project. There are many metrics that helps us to evaluate our model accuracy.

- Classification Accuracy

- Confusion matrix

- Precision

- Recall

- classification_report

- ROC AUC Score

- Area under curve (AUC)

Now, let's see the performance metrics of svc classifier

## Confusion Matrix

- A confusion matrix is a table that can be used to measure the performance of an machine learning algorithm, usually a supervised learning one. Each row of the confusion matrix represents the instances of an actual class and each column represents the instances of a predicted class


In a binary classifier, the "**true**" class is typically labeled with 1 and the "**false**" class is labeled with 0.

  - True Positive: A positive class observation (1) is correctly classified as positive by the model.

  - False Positive: A negative class observation (0) is incorrectly classified as positive.

  - True Negative: A negative class observation is correctly classified as negative.

  - False Negative: A positive class observation is incorrectly classified as negative.

Let’s visualize the confusion matrix, to see how accurate are the results we obtained.

In [None]:
cm = np.array(confusion_matrix(y_test, y_pred_svc, labels=[1,0]))

confusion_mat= pd.DataFrame(cm, index = ['cancer', 'healthy'],
                           columns =['predicted_cancer','predicted_healthy'])

confusion_mat

In [None]:
sns.heatmap(cm,annot=True,fmt='g',cmap='Set3')

- As we can see from the table above:

   - **True Positive(TP)** : Values that the model predicted as yes(Healthy), and is actually yes(Healthy).
   - **True Negative(TN)** : Values that the model predicted as not(Cancer), and is actually no(Cancer).
   - **False Positive(FP)**: Values that the model predicted as yes(Healthy), but actually no(Cancer).
   - **False Negative(FN)**: Values that the model predicted as no (Cancer), but actually yes(Healthy).


For this dataset, whenever the model is predicting something as yes, it indicates Absence of cancer cells (Healthy) and for cases when the model predicting no; it indicates existence of cancer cells(Cancer).



## Accuracy_Score

- **Accuracy_Score** is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations


(TP + TN)/total = 0.98245614

In [None]:
print(accuracy_score(y_test, y_pred_svc))

## Precision 

- **Precision** is the ratio of correctly predicted positive observations to the total predicted positive observations.

In [None]:
print(precision_score(y_test, y_pred_svc))

## Recall

- **Recall** also called Sensitivity, is the ratio of positive instances that are correctly detected by the classifier to the all observations in actual class

In [None]:
print(recall_score(y_test, y_pred_svc))

## Classification Report


In [None]:
print(classification_report(y_test, y_pred_svc))


- True Positive(TP) : 71
    
- True Negative(TN) : 41
    
- False Positive(FP): 2
    
- False Negative(FN): 0


**True Positive Rate/Recall/Sensitivity: How often the model predicts yes(Healthy) when it's actually yes(Healthy)?**

- **True Positive Rate(TPR)** = TP/TP+FP = 71/(871+2) = 0.97


**False Positive Rate: How often the model predicts yes(Healthy) when it's actually no(Cancer)?**

- **False Positive Rate(FPR)** = FP/FP+TN = 2/2+41 = 0.04

## The ROC Curve

In [None]:

#plt.style.use('seaborn-pastel')

y_score = svc.decision_function(X_test)

FPR, TPR, _ = roc_curve(y_test, y_score)
ROC_AUC = auc(FPR, TPR)
print (ROC_AUC)

plt.figure(figsize =[11,9])
plt.plot(FPR, TPR, label= 'ROC curve(area = %0.2f)'%ROC_AUC, linewidth= 4)
plt.plot([0,1],[0,1], 'k--', linewidth = 4)
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.xlabel('False Positive Rate', fontsize = 18)
plt.ylabel('True Positive Rate', fontsize = 18)
plt.title('Receiver operating characteristic example', fontsize= 18)
plt.show()

- The ROC curve shows the trade-off between sensitivity (or TPR) and specificity (1 – FPR). As we notice the **svc** Classifier give a curve closer

  to the top-left corner so it indicate a better performance. 

## Area Under Curve

Area Under Curve is a common way to compare classifiers. A perfect classifier will have ROC AUC equal to 1

Sckit-Learn provides a function to compute the ROC AUC.

In [None]:
roc_auc_score(y_test, y_score)

- [Understanding a Classification Report For Your Machine Learning Model](https://medium.com/@kohlishivam5522/understanding-a-classification-report-for-your-machine-learning-model-88815e2ce397)

- [True Positive Rate](https://www.sciencedirect.com/topics/computer-science/true-positive-rate)

- [How to plot an ROC curve in Python](https://www.kite.com/python/answers/how-to-plot-an-roc-curve-in-python)