# Machine Learning Algorithm on Brest Cancer Dataset

In this Notebook, I'm going to use two types of Classification Models

* Random Forest Classifier
* Support Vector Classifier

To check which has the best Accuracy Scores

## Import Libraries

In [None]:
import pandas as pd 
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt 
%matplotlib inline 

In [None]:
data = pd.read_csv('../input/breast-cancer-wisconsin-data/data.csv')
data.head()

## Exploratory Data Analysis 

In [None]:
data.columns

In [None]:
data.describe()

In [None]:
data.info()

In [None]:
data.isnull().sum()

In [None]:
data = data.set_index('id')
data.drop('Unnamed: 32', axis=1, inplace=True)

In [None]:
data.head()

In [None]:
data.shape

# Data Visualization

In [None]:
sns.countplot(x='diagnosis',hue='diagnosis',data=data, palette='RdBu_r')
M, B = data.diagnosis.value_counts()
print('Number of Benign: ',B)
print('Number of Malignant : ',M)
print('Percentage of Benign: ',B/(B+M)*100)
print('Number of Malignant : ',M/(B+M)*100)

In [None]:
figsize = plt.figure(figsize=(20,14))
sns.heatmap(data.corr(),annot=True,cmap='viridis')


In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
encoder = LabelEncoder()
data.diagnosis = encoder.fit_transform(data.diagnosis)

In [None]:
data.head()

## Train Test Split

**In this section we divide the data into train and test catagories.**

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = data.drop('diagnosis',axis=1)
y = data['diagnosis']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Normalizing the Data

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

# Creating Random Forest Classifier Model

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfc = RandomForestClassifier(n_estimators = 200)

**The fit() method takes the training data as arguments, which can be one array in the case of unsupervised learning, or two arrays in the case of supervised learning.**

In [None]:
rfc.fit(X_train, y_train)


## Predictions and Evaluations of RFC

In [None]:
rfc_prediction = rfc.predict(X_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

 **Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations.**

In [None]:
print('Accuracuy Score:\n')
print(accuracy_score(y_test, rfc_prediction))

In [None]:
print('Confusion Matrix:\n')
print(confusion_matrix(y_test, rfc_prediction))

In [None]:
print('Classifition Report:\n')
print(classification_report(y_test, rfc_prediction))

# Train the Support Vector Classifier

In [None]:
from sklearn.svm import SVC

In [None]:
svc = SVC()

In [None]:
svc.fit(X_train, y_train)

## Predictions and Evaluations of SVC

In [None]:
svc_prediction = svc.predict(X_test)

In [None]:
print('Accuracuy Score:\n')
print(accuracy_score(y_test, svc_prediction))

In [None]:
print('Confusion Matrix:\n')
print(confusion_matrix(y_test, svc_prediction))

In [None]:
print('Classifition Report:\n')
print(classification_report(y_test, svc_prediction))

**In this case Support Vector Classifier Random Forest Classifier achieve the best Accuracy with 98% with respect to Random Forest Classifier which has 96%**