# Mini Project: Breast Cancer Campaign
Classify the diagnosis for a patient based on the same features collected from the Wisconsin Breast Cancer dataset.

You are required to:
- Select the most important features in the Breast Cancer dataset.
- Train multiple classifiers on the dataset to predict the diagnosis class.
- Achieve an acceptable accuracy score.

In [114]:
import pandas as pd

df = pd.read_csv("data_refined.csv")

df

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,diagnosis
0,1.097064,-2.073335,1.269934,0.984375,1.568466,3.283515,2.652874,2.532475,1
1,1.829821,-0.353632,1.685955,1.908708,-0.826962,-0.487072,-0.023846,0.548144,1
2,1.579888,0.456187,1.566503,1.558884,0.942210,1.052926,1.363478,2.037231,1
3,-0.768909,0.253732,-0.592687,-0.764464,3.283553,3.402909,1.915897,1.451707,1
4,1.750297,-1.151816,1.776573,1.826229,0.280372,0.539340,1.371011,1.428493,1
...,...,...,...,...,...,...,...,...,...
564,2.110995,0.721473,2.060786,2.343856,1.041842,0.219060,1.947285,2.320965,1
565,1.704854,2.085134,1.615931,1.723842,0.102458,-0.017833,0.693043,1.263669,1
566,0.702284,2.045574,0.672676,0.577953,-0.840484,-0.038680,0.046588,0.105777,1
567,1.838341,2.336457,1.982524,1.735218,1.525767,3.272144,3.296944,2.658866,1


# Feature Selection

Choosing only the most important features for training your classifier is one of the most important steps of the machine learning process. This can be done in many ways.

One of the simplest approaches is choosing the features with the highest correlation to the target data.

The label in this case is the ‘Diagnosed’ column.

The Diagnosed column has two distinct values:
- M: Malignant Tumor
- B: Benign Tumor

Calculate the correlation of all the features to their target labels.

Choose the most correlated features above a certain limit for training.
Output a list of important feature names.

In [115]:
correlations = df.corr()["diagnosis"].drop("diagnosis")
correlations

radius_mean            0.730029
texture_mean           0.415185
perimeter_mean         0.742636
area_mean              0.708984
smoothness_mean        0.358560
compactness_mean       0.596534
concavity_mean         0.696360
concave points_mean    0.776614
Name: diagnosis, dtype: float64

In [116]:
# threshold = 0.7
threshold = 0 # changed this to 0 to return all features and yield higher accuracy
important_features = []
for i in range(len(correlations.values)):
    if correlations.values[i] > threshold:
        print(correlations.index[i],"is above", threshold,"so is considered important")
        important_features.append(correlations.index[i])
   
print(important_features)

radius_mean is above 0 so is considered important
texture_mean is above 0 so is considered important
perimeter_mean is above 0 so is considered important
area_mean is above 0 so is considered important
smoothness_mean is above 0 so is considered important
compactness_mean is above 0 so is considered important
concavity_mean is above 0 so is considered important
concave points_mean is above 0 so is considered important
['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean']


# Splitting the Data

Split your data as follows:
- 80% training set
- 10% validation set
- 10% test set

In [117]:
from sklearn.model_selection import train_test_split

X = df[important_features]
y = df["diagnosis"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

X_validate, X_test, y_validate, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=0)

print(len(X_train), len(X_test), len(X_validate))

455 57 57


# Training Classifiers

Use KNN classifier, random forest classifier, and support vector classifier (SVC) models to train your data.

Train your full features dataset and your reduced set features dataset.

Get accuracy scores and confusion matrices for both. You need a minimum accuracy score of 94%.

Compare the results.

Hint: you need to choose the optimal value for k using cross validation.

In [118]:
from sklearn.neighbors import KNeighborsClassifier
kNCmodel = KNeighborsClassifier(n_neighbors= 3)
kNCmodel.fit(X_train, y_train)
y_pred = kNCmodel.predict(X_test)
score = kNCmodel.score(X_validate, y_validate)
print("KNeighborsClassifier Accuracy =", score)


from sklearn.ensemble import RandomForestClassifier
rFCmodel = RandomForestClassifier(random_state=0, criterion="gini", n_estimators=700)
rFCmodel.fit(X_train, y_train)
y_pred = rFCmodel.predict(X_test)
score = rFCmodel.score(X_validate, y_validate)
print("RandomForestClassifier Accuracy =", score)

from sklearn.svm import SVC
sVCmodel = SVC(kernel="linear", gamma="auto", C=1.0)
sVCmodel.fit(X_train, y_train)
y_pred = sVCmodel.predict(X_test)
score = sVCmodel.score(X_validate, y_validate)
print("SVC Accuracy =", score)

# I adjusted my threshold value for choosing important features in order to compare all features to the important ones
# e.g. threshold = 0 for all vs. threshold = 0.7 for what I thought were important ones
# I found that RandomForestClassifier using n_estimators around 600 and the entire features set (previously refined in previous miniproject), yielded the best accuracy (98%)



KNeighborsClassifier Accuracy = 0.9298245614035088
RandomForestClassifier Accuracy = 0.9824561403508771
SVC Accuracy = 0.9298245614035088
