In [3]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

In [6]:
#Load  dataset

data=load_breast_cancer()


In [8]:
#Organising our data

label_names=data['target_names']
label=data['target']

feature_names=data['feature_names']
features=data['data']


In [41]:
# Taking a look at our data
# prints first instance of our dataset
print(label_names)
print("-------------------------------------------------------")
print("Class label=",label[0])
print("-------------------------------------------------------")
print(feature_names[0])
print("-------------------------------------------------------")
print(features[0])

['malignant' 'benign']
-------------------------------------------------------
Class label= 0
-------------------------------------------------------
mean radius
-------------------------------------------------------
[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
 4.601e-01 1.189e-01]


* Our class names are malignant ane benign, which are then mapped to binary values of 0 and 1

* 0 represents malignant tumors
* 1 represents benign tumors
* Our first instance is Malignant tumor whose mean radius is 1.799e+01


<font color='blue'> Organizing data into sets</font>

To evaluate how a classifier is performing, we should always test
the model on unseen data. 
Therefore, before building a model, lets split our
data into two parts: *a training set* and a *test set.*

* we use the training set to train and evaluate the model during the development stage and then use the trained model to make predictions on the unseen test set. 

<font color="red"> This approach gives a sense of the model’s performance and robustness. </font>


* sklearn has a function called *train_test_split()*,
which divides data into these sets. 
* Import the function and then use it to split the data:



In [22]:
from sklearn.model_selection import train_test_split

In [25]:
#If random_state is an integer,
#then it is used to seed a new RandomState object.
train,test,train_labels,test_labels=train_test_split(
    features,label,test_size=0.33,random_state=42)

* The function randomly splits the data using the test_size
parameter

* Here we have test set (test) that represents
 33% of the original dataset.
 The remaining data (train) is training data.


* Respective labels for train/test variables are *train_labels* and *test_labels*



<font color="red"> building model and evaluation using Naive Bayes</font>

* import the GaussianNB module. 
* initialize the model with the GaussianNB() function
* Train the model by fitting it to the data using gnb.fit()

In [28]:
from sklearn.naive_bayes import GaussianNB

# initialize classifier
gnb=GaussianNB()

In [29]:
#train clasifier
model = gnb.fit(train,train_labels)

* after training the model we use it to make predictions on our test dataset using *Predict()* function

* The predict() functions returns array of predictions for each data instance in the test set

* we use predict() function with the test set

In [31]:
predc= gnb.predict(test)
print(predc)

[1 0 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0
 1 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 0 1 0
 1 1 1 1 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0
 1 1 0 0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0 1 0 1 1 0 1 0 0
 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 1 1 1 1 1 1 0 0
 0 1 1]


* The function returned predicted values as an array of 0's and 1'a
for tumor class(Malignant v/s benign)

<font color='blue' size=4 > Evaluating the Model’s Accuracy </font>




* we can evaluate the accuracy of our model’s predicted values
by comparing the two arrays (test_labels vs. predc)

* use the sklearn function accuracy_score() to
determine the accuracy

In [40]:
from sklearn.metrics import accuracy_score
print(accuracy_score(test_labels,predc))

0.9414893617021277


* It means that 94.15 % of the time the classifier is able to make the correct prediction as to whether or not the tumor is malignant or benign