# Breast Cancer Wisconsin (Diagnostic) Data Set

### Importing libraries and loading dataset

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")

# loading dataset
df = pd.read_csv("../input/breast-cancer-wisconsin-data/data.csv")

display(df)

In [None]:
df.info()

The given dataset contains 33 columns, out of which the first is ID, second is DIAGNOSIS which has two values - benign (B) and malignant (M), and then there are 30 features that describe the characteristics of the cell nuclei. There are 569 rows and one more column which only contains null values, so it will be dropped.

**Here, DIAGNOSIS will be the target variable.**

In [None]:
# removing the last column which only contains NaN values
df = df.drop(["Unnamed: 32"], axis=1)

## Handling Outliers

A pairplot can help in visualising and plotting pairwise relationships in a dataset.

In [None]:
for i in range(0, len(df.columns), 5):
    sns.pairplot(data=df, x_vars=df.columns[i:i+5], y_vars=['diagnosis'])

It is clear that there are outliers present in the data (eg. in AREA_SE, CONCAVITY_SE, PERIMETER_SE etc.)

The outliers will be removed using the help of IQR scores.

In [None]:
# calculating IQR scores for each column
q1 = df.quantile(0.20)
q3 = df.quantile(0.80)
iqr = q3-q1

# filtering outliers using the calculated IQR scores
df = df[~((df < (q1 - 1.5 * iqr)) |(df > (q3 + 1.5 * iqr))).any(axis=1)]

# plotting the pairplots again
for i in range(0, len(df.columns), 5):
    sns.pairplot(data = df, x_vars = df.columns[i:i+5], y_vars = ['diagnosis'])

Now that the dataset looks much cleaner, we can build our model.

## Decision Tree

To create the model, the data will be split into two sets.

> Training set - 90%

> Testing set - 10%

In [None]:
diagnosis = df['diagnosis']
features = df.drop('diagnosis',axis=1)

# splitting the data into training and testing sets
features_train,features_test,labels_train,labels_test = train_test_split(features,diagnosis,test_size = 0.1)

# creating the classifier
clf = tree.DecisionTreeClassifier(min_samples_split = 10)
clf.fit(features_train,labels_train)

pred = clf.predict(features_test)

print("Accuracy =", accuracy_score(pred, labels_test)*100, "%")