# **Breast Tissue Classification Data Set**
Source UCI Machine Repository:

# **Data Set Information:**
The dataset can be used for predicting the classification of either the original 6 classes or of 4 classes by merging together the fibro-adenoma, mastopathy and glandular classes whose discrimination is not important (they cannot be accurately discriminated anyway).

# **Attribute Information:**
I0 Impedivity (ohm) at zero frequency

PA500 phase angle at 500 KHz

HFS high-frequency slope of phase angle

DA impedance distance between spectral ends

AREA area under spectrum

A/DA area normalized by DA

MAX IP maximum of the spectrum

DR distance between I0 and real part of the maximum frequency point

P length of the spectral curve

Class car(carcinoma), fad (fibro-adenoma), mas (mastopathy), gla (glandular), con (connective), adi (adipose).

In [1]:
# Importing libraries
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score,accuracy_score,recall_score,confusion_matrix

In [None]:
# Reading Dataset
dataset=pd.read_excel("/content/BreastTissue.xls",sheet_name="Data",index_col=0)

# Taking a glance at dataset and class labels, there are no null values in dataset
print(dataset)
print(dataset.info())
print(dataset.describe())
print(dataset.sample(frac=.6))
print(dataset["Class"].value_counts())

In [None]:
# Merging the 3 classes fibro-adenoma, mastopathy and glandular together  
dataset.loc[dataset['Class'] =="fad", 'Class'] = "gla"
dataset.loc[dataset['Class'] =="mas", 'Class'] = "gla"
print(dataset["Class"].value_counts())

In [None]:
# Separating dataset in predictor and dependent variables
X=dataset.iloc[:,1:].values
y=dataset.iloc[:,0].values
print(X)
print(y)

In [None]:
#  Performing train test Split 70%-30%
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.3,random_state=23,stratify=y)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

**Scaling the data**

In [6]:
from sklearn.preprocessing import StandardScaler
scalar=StandardScaler()
X_train=scalar.fit_transform(X_train)
X_test=scalar.transform(X_test)

# print(X_train)
# print(X_test)

# Building a Nearest Neighbour Classifier

In [7]:
from sklearn.neighbors import KNeighborsClassifier
knnclassifier=KNeighborsClassifier(n_neighbors=5,metric="minkowski",p=2)
knnclassifier.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [None]:
# predicting the values
predicted=knnclassifier.predict(X_test)
# Building confusion Matrix
confusion=confusion_matrix(y_test,predicted)
print(f'Confusion Matrix is \n')
print(confusion)

In [11]:
# Caluclting accuracy score
print(f'Accuracy score is {round(accuracy_score(y_test,predicted),2)*100} %')
print(f'Precison score is {round(precision_score(y_test,predicted,average="macro"),2)}')

Accuracy score is 81.0 %
Precison score is 0.79


# **Accuracy score with KNearestNeighbour is 81% which is very low , now we will try Random Forest Classifier if we can get better accuracy.**

In [None]:
np.set_printoptions('precision',2)
from sklearn.ensemble import RandomForestClassifier
randomclassifier=RandomForestClassifier(random_state=34,n_estimators=100)
randomclassifier.fit(X_train,y_train)

In [None]:
# Predicting the values
predicted=randomclassifier.predict(X_test)
# Building confusion Matrix
confusion=confusion_matrix(y_test,predicted)
print(f'Confusion Matrix is \n')
print(confusion)

In [18]:
# Caluclting accuracy score
print(f'Accuracy score is {round(accuracy_score(y_test,predicted),2)*100} %')
print(f'Precison score is {round(precision_score(y_test,predicted,average="macro"),2)}')

Accuracy score is 88.0 %
Precison score is 0.86


**An accuracy of 88% accracy is obtained with Random Forest Classifier(Ensemble Learning) as compared to KNearest Neighbour Classifier**