### Your task this is to create a new notebook 
Perform some data classification tasks on “breast_cancer” data used in Task 5. In this file you need
to create cells to evaluate the performance of the following classifiers:
    SVC(kernel='linear', C=1),
    SVC(kernel='rbf', C=1, gamma = 'auto'),
    KNeighborsClassifier(),
    DecisionTreeClassifier(),
    MLPClassifier(max_iter=1000),
    GaussianNB(),
    RandomForestClassifier(n_estimators=10),
    AdaBoostClassifier()
Use 10 frold cross valudation; that is cv = 10 (which is by default StratifiedKFold).

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn import svm
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import mutual_info_classif
import numpy as np 
import pandas as pd

In [2]:
combined_data = load_breast_cancer()
print(combined_data.target_names)
data = combined_data.data
target = combined_data.target
print(data.shape)

['malignant' 'benign']
(569, 30)


### Step 1. First calculate Mutual Information for each feature, rank them and print ranked (FeatureMI) list. Select top 5 and top 3 features and check by printing top features.

In [3]:
mi=mutual_info_classif(data,target,discrete_features='auto',random_state=0)
fransk=zip(combined_data.feature_names,mi)
for feat,m in fransk:
    print(feat,'---',m)

feature_ranks=np.vstack((combined_data.feature_names,mi))
feature_ranks=feature_ranks.transpose()
feature_ranks=feature_ranks[feature_ranks[:, 1].argsort()[::-1]]

# select top 5 and top 3 features
top_5=feature_ranks[0:5,:]
top_3=feature_ranks[0:3,:]
top_5_features=feature_ranks[0:5,0]
top_3_features=feature_ranks[0:3,0]
print("Top five features:",top_5)
print("Top three features:",top_3)

mean radius --- 0.36119904808342507
mean texture --- 0.09493724540317938
mean perimeter --- 0.40293624681090523
mean area --- 0.35946772186386133
mean smoothness --- 0.07587963575822343
mean compactness --- 0.21617790258088188
mean concavity --- 0.3746368551183623
mean concave points --- 0.43741858933158806
mean symmetry --- 0.0692410004097348
mean fractal dimension --- 0.009907323993580075
radius error --- 0.24889281087047932
texture error --- 0.0013417123312922108
perimeter error --- 0.27671518250382365
area error --- 0.3406452844070458
smoothness error --- 0.015418390136288318
compactness error --- 0.0751634927784961
concavity error --- 0.1145094894650085
concave points error --- 0.12667234778130054
symmetry error --- 0.018126972190404045
fractal dimension error --- 0.03812227422895509
worst radius --- 0.45420269445248374
worst texture --- 0.11890400769933973
worst perimeter --- 0.4733873996275295
worst area --- 0.46328809523445336
worst smoothness --- 0.10001764577989447
worst comp

### Step 2. Apply all the above classifiers to all data, to data with top 5 and top 3 features. 
Print the test accuracy for each classifier as a table below by creating a dataframe with corresponding
columns. Change the default parameter settings in order to see if you can improve the accuracy

In [4]:
df=pd.DataFrame(data=data,columns=combined_data.feature_names)
df_top_5=df[top_5_features]
df_top_3 = df[top_3_features]

classifiers=['SVM(kernel=linear)','SVM(kernel=rbf','KNeighbors','DecisionTree','MLP','GaussianNB','RandomForest','AdaBoost']
df_results=pd.DataFrame(data=classifiers,columns=['Classifier'])
mean_accuracies_all=[]
mean_accuracies_top_5=[]
mean_accuracies_top_3=[]

# SVC(kernel='linear', C=1),
clf = svm.SVC(kernel='linear', C=1)
scores_all = cross_val_score(clf, data, target, cv=10)
scores_top_5 = cross_val_score(clf, df_top_5, target, cv=10)
scores_top_3 = cross_val_score(clf, df_top_3, target, cv=10)
print("Results for SVM with linear kernel")
print("==============================================================")
print("All features - Each fold accuracy (StratifiedKFold)", scores_all)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_all.mean(), scores_all.std() * 2))
print("Top 5 featuers - Each fold accuracy (StratifiedKFold)", scores_top_5)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_top_5.mean(), scores_top_5.std() * 2))
print("Top 3 featuers - Each fold accuracy (StratifiedKFold)", scores_top_3)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_top_3.mean(), scores_top_3.std() * 2))
mean_accuracies_all.append(scores_all.mean())
mean_accuracies_top_5.append(scores_top_5.mean())
mean_accuracies_top_3.append(scores_top_3.mean())
print("---------------------------------------------------------------")

# SVC(kernel='rbf', C=1, gamma = 'auto')
clf = svm.SVC(kernel='rbf', C=1, gamma = 'auto')
scores_all = cross_val_score(clf, data, target, cv=10)
scores_top_5 = cross_val_score(clf, df_top_5, target, cv=10)
scores_top_3 = cross_val_score(clf, df_top_3, target, cv=10)
print("Results for SVM with RBF kernel")
print("==============================================================")
print("All features - Each fold accuracy (StratifiedKFold)", scores_all)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_all.mean(), scores_all.std() * 2))
print("Top 5 featuers - Each fold accuracy (StratifiedKFold)", scores_top_5)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_top_5.mean(), scores_top_5.std() * 2))
print("Top 3 featuers - Each fold accuracy (StratifiedKFold)", scores_top_3)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_top_3.mean(), scores_top_3.std() * 2))
mean_accuracies_all.append(scores_all.mean())
mean_accuracies_top_5.append(scores_top_5.mean())
mean_accuracies_top_3.append(scores_top_3.mean())
print("---------------------------------------------------------------")

# KNeighborsClassifier(),
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=2)
scores_all = cross_val_score(clf, data, target, cv=10)
scores_top_5 = cross_val_score(clf, df_top_5, target, cv=10)
scores_top_3 = cross_val_score(clf, df_top_3, target, cv=10)
print("Results for KNN")
print("==============================================================")
print("All features - Each fold accuracy (StratifiedKFold)", scores_all)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_all.mean(), scores_all.std() * 2))
print("Top 5 featuers - Each fold accuracy (StratifiedKFold)", scores_top_5)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_top_5.mean(), scores_top_5.std() * 2))
print("Top 3 featuers - Each fold accuracy (StratifiedKFold)", scores_top_3)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_top_3.mean(), scores_top_3.std() * 2))
mean_accuracies_all.append(scores_all.mean())
mean_accuracies_top_5.append(scores_top_5.mean())
mean_accuracies_top_3.append(scores_top_3.mean())
print("---------------------------------------------------------------")

# DecisionTreeClassifier(),
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
scores_all = cross_val_score(clf, data, target, cv=10)
scores_top_5 = cross_val_score(clf, df_top_5, target, cv=10)
scores_top_3 = cross_val_score(clf, df_top_3, target, cv=10)
print("Results for Decision Tree")
print("==============================================================")
print("All features - Each fold accuracy (StratifiedKFold)", scores_all)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_all.mean(), scores_all.std() * 2))
print("Top 5 featuers - Each fold accuracy (StratifiedKFold)", scores_top_5)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_top_5.mean(), scores_top_5.std() * 2))
print("Top 3 featuers - Each fold accuracy (StratifiedKFold)", scores_top_3)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_top_3.mean(), scores_top_3.std() * 2))
mean_accuracies_all.append(scores_all.mean())
mean_accuracies_top_5.append(scores_top_5.mean())
mean_accuracies_top_3.append(scores_top_3.mean())
print("---------------------------------------------------------------")

# MLPClassifier(max_iter=1000),
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(random_state=1, max_iter=500)
scores_all = cross_val_score(clf, data, target, cv=10)
scores_top_5 = cross_val_score(clf, df_top_5, target, cv=10)
scores_top_3 = cross_val_score(clf, df_top_3, target, cv=10)
print("Results for MLP Classifier")
print("==============================================================")
print("All features - Each fold accuracy (StratifiedKFold)", scores_all)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_all.mean(), scores_all.std() * 2))
print("Top 5 featuers - Each fold accuracy (StratifiedKFold)", scores_top_5)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_top_5.mean(), scores_top_5.std() * 2))
print("Top 3 featuers - Each fold accuracy (StratifiedKFold)", scores_top_3)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_top_3.mean(), scores_top_3.std() * 2))
mean_accuracies_all.append(scores_all.mean())
mean_accuracies_top_5.append(scores_top_5.mean())
mean_accuracies_top_3.append(scores_top_3.mean())
print("---------------------------------------------------------------")

# GaussianNB()
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
scores_all = cross_val_score(clf, data, target, cv=10)
scores_top_5 = cross_val_score(clf, df_top_5, target, cv=10)
scores_top_3 = cross_val_score(clf, df_top_3, target, cv=10)
print("Results for Gaussian NB")
print("==============================================================")
print("All features - Each fold accuracy (StratifiedKFold)", scores_all)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_all.mean(), scores_all.std() * 2))
print("Top 5 featuers - Each fold accuracy (StratifiedKFold)", scores_top_5)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_top_5.mean(), scores_top_5.std() * 2))
print("Top 3 featuers - Each fold accuracy (StratifiedKFold)", scores_top_3)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_top_3.mean(), scores_top_3.std() * 2))
mean_accuracies_all.append(scores_all.mean())
mean_accuracies_top_5.append(scores_top_5.mean())
mean_accuracies_top_3.append(scores_top_3.mean())
print("---------------------------------------------------------------")

# RandomForestClassifier(n_estimators=10),
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10, random_state=0)
scores_all = cross_val_score(clf, data, target, cv=10)
scores_top_5 = cross_val_score(clf, df_top_5, target, cv=10)
scores_top_3 = cross_val_score(clf, df_top_3, target, cv=10)
print("Results for Random Forest")
print("==============================================================")
print("All features - Each fold accuracy (StratifiedKFold)", scores_all)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_all.mean(), scores_all.std() * 2))
print("Top 5 featuers - Each fold accuracy (StratifiedKFold)", scores_top_5)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_top_5.mean(), scores_top_5.std() * 2))
print("Top 3 featuers - Each fold accuracy (StratifiedKFold)", scores_top_3)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_top_3.mean(), scores_top_3.std() * 2))
mean_accuracies_all.append(scores_all.mean())
mean_accuracies_top_5.append(scores_top_5.mean())
mean_accuracies_top_3.append(scores_top_3.mean())
print("---------------------------------------------------------------")

# AdaBoostClassifier()
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier(n_estimators=10, random_state=0)
scores_all = cross_val_score(clf, data, target, cv=10)
scores_top_5 = cross_val_score(clf, df_top_5, target, cv=10)
scores_top_3 = cross_val_score(clf, df_top_3, target, cv=10)
print("Results for AdaBoostClassifier")
print("==============================================================")
print("All features - Each fold accuracy (StratifiedKFold)", scores_all)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_all.mean(), scores_all.std() * 2))
print("Top 5 featuers - Each fold accuracy (StratifiedKFold)", scores_top_5)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_top_5.mean(), scores_top_5.std() * 2))
print("Top 3 featuers - Each fold accuracy (StratifiedKFold)", scores_top_3)
print("Mean Accuracy: %0.2f (+/- %0.2f)" % (scores_top_3.mean(), scores_top_3.std() * 2))
mean_accuracies_all.append(scores_all.mean())
mean_accuracies_top_5.append(scores_top_5.mean())
mean_accuracies_top_3.append(scores_top_3.mean())
print("---------------------------------------------------------------")

df_results['Accuracy_all_features'] = np.asarray(mean_accuracies_all)
df_results['Accuracy_top5_features'] = np.asarray(mean_accuracies_top_5)
df_results['Accuracy_top3_features'] = np.asarray(mean_accuracies_top_3)

with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(df_results)

Results for SVM with linear kernel
All features - Each fold accuracy (StratifiedKFold) [0.98245614 0.92982456 0.92982456 0.94736842 0.96491228 0.98245614
 0.92982456 0.94736842 0.96491228 0.96428571]
Mean Accuracy: 0.95 (+/- 0.04)
Top 5 featuers - Each fold accuracy (StratifiedKFold) [0.92982456 0.87719298 0.87719298 0.9122807  0.96491228 0.89473684
 0.9122807  0.92982456 0.92982456 0.94642857]
Mean Accuracy: 0.92 (+/- 0.05)
Top 3 featuers - Each fold accuracy (StratifiedKFold) [0.9122807  0.87719298 0.87719298 0.9122807  0.96491228 0.89473684
 0.92982456 0.92982456 0.92982456 0.94642857]
Mean Accuracy: 0.92 (+/- 0.05)
---------------------------------------------------------------
Results for SVM with RBF kernel
All features - Each fold accuracy (StratifiedKFold) [0.61403509 0.61403509 0.63157895 0.63157895 0.63157895 0.63157895
 0.63157895 0.63157895 0.63157895 0.625     ]
Mean Accuracy: 0.63 (+/- 0.01)
Top 5 featuers - Each fold accuracy (StratifiedKFold) [0.66666667 0.64912281 0.71

### Step 3 Conclusions. Analyse and write down your main observations (as Markdown). 
Which classifier is the best for prediction on this data?

from the summarized results we can see that the SVM classifier with linear kernel is performing best in all three cases. As the number features are reduced the general trend is that the accuracy is reduced. 