# Naive Bayes Algorithm Implementation on Cancer Dataset

## Content:

1. [Importing Dataset](#1)
1. [Getting Info About Dataset](#2)
1. [Dataset Visualization](#3)
1. [Meaning Of Naive Bayes Algorithm](#4)
1. [Naive Bayes with Sklearn](#5)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id = "1"></a>

## 1. Importing Dataset:

In [None]:
dataset = pd.read_csv("/kaggle/input/breast-cancer-wisconsin-data/data.csv")

> <a id = "2"></a>
# 2. Getting Info About Dataset

Dataset information:

* Dataset Characteristics: Multivariate
* Attribute Characteristics: Real
* Attribute Characteristics: Classification
* Number of Instances: 569
* Number of Attributes: 32
* Missing Values: No

Column names and meanings:
* id: ID number
* diagnosis: The diagnosis of breast tissues (M = malignant, B = benign)
* radius_mean: mean of distances from center to points on the perimeter
* texture_mean: standard deviation of gray-scale values
* perimeter_mean: mean size of the core tumor
* area_mean: area of the tumor
* smoothness_mean: mean of local variation in radius lengths
* compactness_mean: mean of perimeter^2 / area - 1.0
* concavity_mean: mean of severity of concave portions of the contour
* concave_points_mean: mean for number of concave portions of the contour
* symmetry_mean
* fractal_dimension_mean: mean for "coastline approximation" - 1
* radius_se: standard error for the mean of distances from center to points on the perimeter
* texture_se: standard error for standard deviation of gray-scale values
* perimeter_se
* area_se
* smoothness_se: standard error for local variation in radius lengths
* compactness_se: standard error for perimeter^2 / area - 1.0
* concavity_se: standard error for severity of concave portions of the contour
* concave_points_se: standard error for number of concave portions of the contour
* symmetry_se
* fractal_dimension_se: standard error for "coastline approximation" - 1
* radius_worst: "worst" or largest mean value for mean of distances from center to points on the perimeter
* texture_worst: "worst" or largest mean value for standard deviation of gray-scale values
* perimeter_worst
* area_worst
* smoothness_worst: "worst" or largest mean value for local variation in radius lengths
* compactness_worst: "worst" or largest mean value for perimeter^2 / area - 1.0
* concavity_worst: "worst" or largest mean value for severity of concave portions of the contour
* concave_points_worst: "worst" or largest mean value for number of concave portions of the contour
* symmetry_worst
* fractal_dimension_worst: "worst" or largest mean value for "coastline approximation" - 1

In [None]:
dataset.head()

In [None]:
dataset.info()

Now, let's get rid of "id" and "Unnamed: 32" features because we don't need to use them while diagnosing whether the patient has a cancer or not. 

In [None]:
dataset = dataset.drop(["id"], axis = 1)

In [None]:
dataset = dataset.drop(["Unnamed: 32"], axis = 1)

In [None]:
dataset.head(3)

In [None]:
M = dataset[dataset.diagnosis == "M"]

In [None]:
M.head(5)

In [None]:
B = dataset[dataset.diagnosis == "B"]

In [None]:
B.head(5)

<a id = "3"></a>
# 3. Dataset Visualization 

In [None]:
plt.title("Malignant vs Benign Tumor")
plt.xlabel("Radius Mean")
plt.ylabel("Texture Mean")
plt.scatter(M.radius_mean, M.texture_mean, color = "red", label = "Malignant", alpha = 0.3)
plt.scatter(B.radius_mean, B.texture_mean, color = "lime", label = "Benign", alpha = 0.3)
plt.legend()
plt.show()

<a id = "4"></a>
# 4. Meaning Of Naive Bayes Algorithm

![](https://www.intelkit.com/wp-content/uploads/2020/04/Naive-bayes-1.png)

<a id = "5"></a>
# 5. KNN with Sklearn

In [None]:
dataset.diagnosis = [1 if i == "M" else 0 for i in dataset.diagnosis]

In [None]:
x = dataset.drop(["diagnosis"], axis = 1)
y = dataset.diagnosis.values

In [None]:
# Normalization:
x = (x - np.min(x)) / (np.max(x) - np.min(x))

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)

In [None]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(x_train, y_train)

In [None]:
print("Naive Bayes score: ",nb.score(x_test, y_test))

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn import linear_model
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.cluster import KMeans
from lightgbm import LGBMClassifier
from sklearn import tree
from sklearn.metrics import roc_auc_score

In [None]:
knn = KNeighborsClassifier(n_neighbors=15)
clf = knn.fit(x_train, y_train)
y_pred = clf.predict(x_test)
acc_knb_model=roc_auc_score(y_test, y_pred)*100
acc_knb_model

In [None]:
lr = LogisticRegression(C = 0.2)
clf1 = lr.fit(x_train, y_train)
y_pred1 = clf1.predict(x_test)
acc_log_reg=roc_auc_score(y_test, y_pred1)*100
acc_log_reg

In [None]:
clf3 = tree.DecisionTreeClassifier().fit(x_train, y_train)
y_pred3 = clf3.predict(x_test)
acc_dt=roc_auc_score(y_test, y_pred3)*100
acc_dt

In [None]:
clf4 = RandomForestClassifier(max_depth=5, random_state=0).fit(x_train, y_train)
y_pred4 = clf4.predict(x_test)
acc_rmf_model=roc_auc_score(y_test, y_pred4)*100
acc_rmf_model

In [None]:
clf5 = SVC(gamma='auto').fit(x_train, y_train)
y_pred5 = clf5.predict(x_test)
acc_svm_model=roc_auc_score(y_test, y_pred5)*100
acc_svm_model

In [None]:
sgd_model=SGDClassifier()
sgd_model.fit(x_train,y_train)
sgd_pred=sgd_model.predict(x_test)
acc_sgd=round(sgd_model.score(x_train,y_train)*100,10)
acc_sgd

In [None]:
xgb_model=XGBClassifier()
xgb_model.fit(x_train,y_train)
xgb_pred=xgb_model.predict(x_test)
acc_xgb=round(xgb_model.score(x_train,y_train)*100,10)
acc_xgb

In [None]:
lgbm = LGBMClassifier()
lgbm.fit(x_train,y_train)
lgbm_pred=lgbm.predict(x_test)
acc_lgbm=round(lgbm.score(x_train,y_train)*100,10)
acc_lgbm

In [None]:
regr = linear_model.LinearRegression()
regr.fit(x_train,y_train)
regr_pred=regr.predict(x_test)
acc_regr=round(regr.score(x_train,y_train)*100,10)
acc_regr

In [None]:
results = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest','Stochastic Gradient Decent','Linear Regression','Naive Bayes','XGBoost','LightGBM','Decision Tree'],
    'Score': [acc_svm_model, acc_knb_model, acc_log_reg, 
              acc_rmf_model,acc_sgd,acc_regr,nb.score(x_test,y_test)*100,acc_xgb,acc_lgbm,acc_dt]})
result_df = results.sort_values(by='Score', ascending=False)
result_df = result_df.set_index('Score')
result_df