# Introduction

The aim of this project is that determine whether breast cancer cell benign or malignant by using SVM, Decision Tree Classifiers and Neural Network respectively in order to comparise their accuracy in such a dataset. 

1. [Load and Check Data](#1)
2. [Data Analysis](#2)
3. [Modelling](#3)
    * [Train - Test Split](#4)
    * [Standardization](#5)
    * [Random Forest Classifier](#6)
    * [SVM Classifier](#7)
    * [Neural Network](#8)
4. [Accuracy Scores](#9) 
5. [Result](#10)
    
    

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id = '1'><a/><br>

# Load and Check Data

In [None]:
#importing required packeges.
import seaborn as sns 
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split


import warnings
warnings.filterwarnings("ignore")

In [None]:
#loading dataset
data = pd.read_csv("/kaggle/input/breast-cancer-wisconsin-data/data.csv")

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.isnull().sum()

In [None]:
#drop the unneccessay columns
data.drop(['Unnamed: 32','id'], inplace = True, axis = 1)

In [None]:
data.columns

<a id = '2'><a/><br>


# Data Analysis

In [None]:
ax = sns.countplot(data["diagnosis"])
print(data.diagnosis.value_counts())

In [None]:
#converting categorical data into numerical data in order to use as train set (Benign = 0 , Malignant = 1)
data["diagnosis"] = [1 if i == "M" else 0 for i in data.diagnosis]

<a id = '3'><a/><br>
    
# Modelling    

In [None]:
#seperate the dataset as response variable and feature variable
X = data.drop('diagnosis', axis = 1)
y = data['diagnosis']

<a id = '4'><a/><br>
    
## Train - Test Split    

In [None]:
#train and test splitting of data
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 42)

In [None]:
print("X_train:" , len(X_train))
print("X_test:" , len(X_test))
print("y_train:" , len(y_train))
print("y_test:" , len(y_test))

<a id = '5'><a/><br>
    
## Standardization   

In [None]:
#applying standard scaling in order to get optimized result

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)


In [None]:
#you can see here, variables are much more uniform

X_train[:5]

In [None]:
X_train_df = pd.DataFrame(X_train, columns = X.columns)
X_train_df_describe = X_train_df.describe()
X_train_df_describe

<a id = '6'><a/><br>

## Random Forest Classifier

In [None]:
rfc = RandomForestClassifier(n_estimators=200)
rfc.fit(X_train, y_train)
pred_rfc = rfc.predict(X_test)

In [None]:
#let's see how our model performed

print(classification_report(y_test, pred_rfc))
print(confusion_matrix(y_test, pred_rfc))

<a id = '7'><a/><br>

## SVM Classifier

In [None]:
from sklearn import svm
from sklearn.svm import SVC

clf=svm.SVC()
clf.fit(X_train, y_train)
pred_clf = clf.predict(X_test)

In [None]:
print(classification_report(y_test, pred_clf))
print(confusion_matrix(y_test, pred_clf))

<a id = '8'><a/><br>

## Neural Network

In [None]:
mlpc = MLPClassifier(hidden_layer_sizes=(11,11,11),max_iter=500)
mlpc.fit(X_train, y_train)
pred_mlpc = mlpc.predict(X_test)

In [None]:
print(classification_report(y_test, pred_mlpc))
print(confusion_matrix(y_test, pred_mlpc))

<a id = '9'><a/><br>

# Accuracy Scores

In [None]:
from sklearn.metrics import accuracy_score

cm_rfc = accuracy_score(y_test, pred_rfc)
cm_clf = accuracy_score(y_test, pred_clf)
cm_mlpc = accuracy_score(y_test, pred_mlpc)

print("Random Forest Clasification:", cm_rfc)
print("SVM Classifier:", cm_clf)
print("Neural Network:", cm_mlpc)

<a id = '10'><a/><br>

# Result

Here, we see that SVM and Neural Network showed same and better performance than Randon Forrest Classifier in these dataset. 