**PREDICTING BREAST CANCER USING VARIOUS CLASSIFICATION MODELS.**
* The features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant.
* 1 = Malignant (Cancerous) - Present (M)
* 0  = Benign (Not Cancerous) -Absent (B)

In [None]:
#importing libaries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#importing dataset
ds = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')

In [None]:
#reviewing dataset
pd.set_option('display.max_columns',None)
ds.head()

In [None]:
#dropping unnecessary features
ds.drop(['id', 'Unnamed: 32'], axis = 1, inplace = True)

In [None]:
#checking type of feaures
ds.info()

In [None]:
#dataset has 569 rows and 31 columns
ds.shape

In [None]:
#checking for null values
ds.isnull().sum()

**NO MISSING DATA**

In [None]:
#taking care of categorical values
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
ds['diagnosis']=le.fit_transform(ds['diagnosis'])

In [None]:
ds.head()

In [None]:
plt.figure(figsize=(16,14))
sns.heatmap(ds.corr(), cmap='Blues', annot = True)
plt.title("Correlation Map", fontweight = "bold", fontsize=16)

**WE CAN EITHER REMOVE THE HIGH CORRELATED FEATURES OR WE CAN USE ALL THE FEATURES, I AM USING ALL FEATURES.**
* **REMOVING CORRELATED FEATURES MAY INCREASE ACCURACY**

In [None]:
#defining dependent and independent variables
x = ds.drop('diagnosis', axis=1)
y = ds['diagnosis']

In [None]:
#splitting data into training and testing set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)

**APPLYING FEATURE SCALING MAY IMPROVE ACCURACY**

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.fit_transform(x_test)

**APPLYING MODELS**

In [None]:
#training model
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(max_iter = 10000)
lr.fit(x_train,y_train)

#getting confusion matrix
from sklearn.metrics import confusion_matrix
y_pred = lr.predict(x_test)
cm = confusion_matrix(y_test,y_pred)
print('confusion matrix:\n',cm)

#checking accuracy
from sklearn.metrics import accuracy_score
lra = accuracy_score(y_test,y_pred)
print('accuracy score = ',lra)

In [None]:
#training model
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski',p = 2)
knn.fit(x_train,y_train)

#getting confusion matrix
from sklearn.metrics import confusion_matrix
y_pred = knn.predict(x_test)
cm = confusion_matrix(y_test,y_pred)
print('confusion matrix:\n',cm)

#checking accuracy
from sklearn.metrics import accuracy_score
knna = accuracy_score(y_test,y_pred)
print('accuracy score = ',accuracy_score(y_test,y_pred))

In [None]:
#training model
from sklearn.svm import SVC
svc = SVC(kernel = 'linear')
svc.fit(x_train,y_train)

#getting confusion matrix
from sklearn.metrics import confusion_matrix
y_pred = svc.predict(x_test)
cm = confusion_matrix(y_test,y_pred)
print('confusion matrix:\n',cm)

#checking accuracy
from sklearn.metrics import accuracy_score
sva =accuracy_score(y_test,y_pred)
print('accuracy score = ',accuracy_score(y_test,y_pred))

In [None]:
#training model
from sklearn.svm import SVC
svc = SVC(kernel = 'rbf')
svc.fit(x_train,y_train)

#getting confusion matrix
from sklearn.metrics import confusion_matrix
y_pred = svc.predict(x_test)
cm = confusion_matrix(y_test,y_pred)
print('confusion matrix:\n',cm)

#checking accuracy
from sklearn.metrics import accuracy_score
sva2 = accuracy_score(y_test,y_pred)
print('accuracy score = ',accuracy_score(y_test,y_pred))

In [None]:
#training model
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(x_train,y_train)

#getting confusion matrix
from sklearn.metrics import confusion_matrix
y_pred = nb.predict(x_test)
cm = confusion_matrix(y_test,y_pred)
print('confusion matrix:\n',cm)

#checking accuracy
from sklearn.metrics import accuracy_score
nba = accuracy_score(y_test,y_pred)
print('accuracy score = ',accuracy_score(y_test,y_pred))

In [None]:
#training model
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(criterion = 'entropy')
dt.fit(x_train,y_train)

#getting confusion matrix
from sklearn.metrics import confusion_matrix
y_pred = dt.predict(x_test)
cm = confusion_matrix(y_test,y_pred)
print('confusion matrix:\n',cm)

#checking accuracy
from sklearn.metrics import accuracy_score
dta = accuracy_score(y_test,y_pred)
print('accuracy score = ',accuracy_score(y_test,y_pred))

In [None]:
#training model
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 60, criterion = 'entropy',random_state = 0)
rf.fit(x_train,y_train)

#getting confusion matrix
from sklearn.metrics import confusion_matrix
y_pred = rf.predict(x_test)
cm = confusion_matrix(y_test,y_pred)
print('confusion matrix:\n',cm)

#checking accuracy
from sklearn.metrics import accuracy_score
rfa = accuracy_score(y_test,y_pred)
print('accuracy score = ',accuracy_score(y_test,y_pred))

In [None]:
#comparing accuracies
plt.figure(figsize= (8,7))
ac = [lra,knna,sva,sva2,nba,dta,rfa]
name = ['Logistic Regression','knn','svm','Kernel Svm','Naive Bayes','Decision Tree', 'Random Forest']
sns.barplot(x = ac,y = name,palette='pastel')
plt.title("Plotting the Model Accuracies", fontsize=16, fontweight="bold")

**THERE WAS A TIE BETWEEN RANDOM FOREST AND KERNEL SVM WITH AN ACCURACY OF 98.2 %**