# Poboljšanje rezultata

In [87]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import StandardScaler

In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
%store -r data

In [72]:
data.head()

Unnamed: 0,Diagnosis,Mean Radius,Mean Texture,Mean Perimeter,Mean Area,Mean Smoothness,Mean Compactness,Mean Concavity,Mean Concave Points,Mean Symmetry,...,Worst Radius,Worst Texture,Worst Perimeter,Worst Area,Worst Smoothness,Worst Compactness,Worst Concavity,Worst Concave Points,Worst Symmetry,Worst Fractal dimension
0,1,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,1,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,1,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,1,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


Kako bismo pokušali poboljšati rezultate klasifikacije, napravit ćemo nekoliko vrsta normalizacije značajki i usporediti točnosti modela.

In [278]:
def klasificiraj(train_test_data, algo):
    X_train, X_test, y_train, y_test = train_test_data
    model = algo()   
    model.fit(X_train, y_train)

    y_pred_train = model.predict(np.ascontiguousarray(X_train))
    y_pred_test = model.predict(np.ascontiguousarray(X_test))

    accuracy = accuracy_score(y_pred_train,y_train)
    print(f'Training Accuracy: {round(accuracy*100, 1)}%')

    accuracy = accuracy_score(y_pred_test,y_test)
    print(f'Testing Accuracy: {round(accuracy*100, 1)}%')

## Normalizacija min-max

In [279]:
df = data.copy()
columns_to_normalize = df.columns.difference(['diagnosis'])
df_normalized = df.copy()
df_normalized[columns_to_normalize] = (df[columns_to_normalize] - df[columns_to_normalize].min()) / (df[columns_to_normalize].max() - df[columns_to_normalize].min())


Setovi za treniranje i testiranje normaliziranih podataka.

In [280]:
X = df_normalized.loc[:, df_normalized.columns != 'Diagnosis']
y = df_normalized.loc[:, 'Diagnosis']

train_test_data = train_test_split(X, y, test_size=0.25, random_state=42)

### Logistička regresija

In [281]:
log_reg = lambda: LogisticRegression(max_iter=3000, random_state=42)
klasificiraj(train_test_data, log_reg)

Training Accuracy: 96.7%
Testing Accuracy: 98.6%


### Metoda potpornih vektora

In [282]:
svm = lambda: SVC(kernel="linear", random_state=42)
klasificiraj(train_test_data, svm)

Training Accuracy: 97.4%
Testing Accuracy: 98.6%


### Random Forest

In [283]:
rnd_forest = lambda: RandomForestClassifier(random_state=42)
klasificiraj(train_test_data, rnd_forest)

Training Accuracy: 100.0%
Testing Accuracy: 96.5%


### Naive Bayes

In [284]:
bayes = lambda: GaussianNB()
klasificiraj(train_test_data, bayes)

Training Accuracy: 93.7%
Testing Accuracy: 95.1%


### Stablo odluke

In [285]:
dec_tree = lambda: DecisionTreeClassifier(random_state=42)
klasificiraj(train_test_data, dec_tree)

Training Accuracy: 100.0%
Testing Accuracy: 95.1%


### Algoritam k-najbližih susjeda

In [286]:
knn = lambda: KNeighborsClassifier(n_neighbors=3)
klasificiraj(train_test_data, knn)

Training Accuracy: 98.1%
Testing Accuracy: 97.2%


## Normalizacija z-vrijednosti

In [287]:
df = data.copy()
columns_to_standardize = data.columns.difference(['diagnosis'])
df_standardized = df.copy()
df_standardized[columns_to_standardize] = (df[columns_to_standardize] - df[columns_to_standardize].mean()) / df[columns_to_standardize].std()

X = df_standardized.loc[:, df_standardized.columns != 'Diagnosis']
y = df.loc[:, 'Diagnosis']

train_test_data = train_test_split(X, y, test_size=0.25, random_state=42)

### Logistička regresija

In [288]:
log_reg = lambda: LogisticRegression(max_iter=3000, random_state=42)
klasificiraj(train_test_data, log_reg)

Training Accuracy: 98.8%
Testing Accuracy: 97.9%


### Metoda potpornih vektora

In [289]:
svm = lambda: SVC(kernel="linear", random_state=42)
klasificiraj(train_test_data, svm)

Training Accuracy: 98.8%
Testing Accuracy: 97.2%


### Random Forest

In [290]:
rnd_forest = lambda: RandomForestClassifier(random_state=42)
klasificiraj(train_test_data, rnd_forest)

Training Accuracy: 100.0%
Testing Accuracy: 96.5%


### Naive Bayes

In [291]:
bayes = lambda: GaussianNB()
klasificiraj(train_test_data, bayes)

Training Accuracy: 93.7%
Testing Accuracy: 95.1%


### Stablo odluke

In [292]:
dec_tree = lambda: DecisionTreeClassifier(random_state=42)
klasificiraj(train_test_data, dec_tree)

Training Accuracy: 100.0%
Testing Accuracy: 95.1%


### Algoritam k-najbližih susjeda

In [293]:
knn = lambda: KNeighborsClassifier(n_neighbors=3)
klasificiraj(train_test_data, knn)

Training Accuracy: 98.6%
Testing Accuracy: 96.5%


## Rekurzivna eliminacija značajki (logistička regresija kao model, 15 značajki)

Pokušat ćemo poboljšati rezultate i pomoću rekurzivne eliminacije značajki koja spada u poluautomatizirane postupke inženjersta značajki. S obzirom da ova metoda treba model prema kojem će izabirati značajke, probat ćemo s logističkom regresijom.

In [294]:
from sklearn.feature_selection import RFE

X = df.loc[:, df.columns != 'Diagnosis']
y = df.loc[:, 'Diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

num_features_to_select = 15
model_for_rfe = LogisticRegression(max_iter=2500) 
rfe = RFE(estimator=model_for_rfe, n_features_to_select=num_features_to_select)

rfe.fit(X_train, y_train)

selected_features = X.columns[rfe.support_]
print("Odabrane značajke:", selected_features)
X_train_selected = rfe.transform(X_train)
X_test_selected = rfe.transform(X_test)

train_test_data = X_train_selected, X_test_selected, y_train, y_test

Odabrane značajke: Index(['Mean Radius', 'Mean Smoothness', 'Mean Compactness', 'Mean Concavity',
       'Mean Concave Points', 'Mean Symmetry', 'Perimeter SE', 'Worst Radius',
       'Worst Texture', 'Worst Perimeter', 'Worst Smoothness',
       'Worst Compactness', 'Worst Concavity', 'Worst Concave Points',
       'Worst Symmetry'],
      dtype='object')


### Logistička regresija

In [295]:
log_reg = lambda: LogisticRegression(max_iter=3000, random_state=42)
klasificiraj(train_test_data, log_reg)

Training Accuracy: 95.3%
Testing Accuracy: 97.2%


### Metoda potpornih vektora

In [296]:
svm = lambda: SVC(kernel="linear", random_state=42)
klasificiraj(train_test_data, svm)

Training Accuracy: 96.2%
Testing Accuracy: 97.2%


### Random Forest

In [297]:
rnd_forest = lambda: RandomForestClassifier(random_state=42)
klasificiraj(train_test_data, rnd_forest)

Training Accuracy: 100.0%
Testing Accuracy: 96.5%


### Naive Bayes

In [298]:
bayes = lambda: GaussianNB()
klasificiraj(train_test_data, bayes)

Training Accuracy: 93.2%
Testing Accuracy: 94.4%


### Stablo odluke

In [299]:
dec_tree = lambda: DecisionTreeClassifier(random_state=42)
klasificiraj(train_test_data, dec_tree)

Training Accuracy: 100.0%
Testing Accuracy: 93.0%


### Algoritam k-najbližih susjeda

In [300]:
knn = lambda: KNeighborsClassifier(n_neighbors=3)
klasificiraj(train_test_data, knn)

Training Accuracy: 95.5%
Testing Accuracy: 95.8%


## Rekurzivna eliminacija značajki (SVM kao model, 15 značajki)

In [301]:
X = df.loc[:, df.columns != 'Diagnosis']
y = df.loc[:, 'Diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

num_features_to_select = 15
model_for_rfe = SVC(kernel='linear')
rfe = RFE(estimator=model_for_rfe, n_features_to_select=num_features_to_select)

rfe.fit(X_train, y_train)

selected_features = X.columns[rfe.support_]
print("Odabrane značajke:", selected_features)
X_train_selected = rfe.transform(X_train)
X_test_selected = rfe.transform(X_test)

train_test_data = X_train_selected, X_test_selected, y_train, y_test

Odabrane značajke: Index(['Mean Radius', 'Mean Smoothness', 'Mean Compactness', 'Mean Concavity',
       'Mean Concave Points', 'Mean Symmetry', 'Radius SE', 'Worst Radius',
       'Worst Texture', 'Worst Smoothness', 'Worst Compactness',
       'Worst Concavity', 'Worst Concave Points', 'Worst Symmetry',
       'Worst Fractal dimension'],
      dtype='object')


### Logistička regresija

In [302]:
log_reg = lambda: LogisticRegression(max_iter=3000, random_state=42)
klasificiraj(train_test_data, log_reg)

Training Accuracy: 95.5%
Testing Accuracy: 97.2%


### Metoda potpornih vektora

In [303]:
svm = lambda: SVC(kernel="linear", random_state=42)
klasificiraj(train_test_data, svm)

Training Accuracy: 96.2%
Testing Accuracy: 97.2%


### Random Forest

In [304]:
rnd_forest = lambda: RandomForestClassifier(random_state=42)
klasificiraj(train_test_data, rnd_forest)

Training Accuracy: 100.0%
Testing Accuracy: 96.5%


### Naive Bayes

In [305]:
bayes = lambda: GaussianNB()
klasificiraj(train_test_data, bayes)

Training Accuracy: 92.7%
Testing Accuracy: 94.4%


### Stablo odluke

In [306]:
dec_tree = lambda: DecisionTreeClassifier(random_state=42)
klasificiraj(train_test_data, dec_tree)

Training Accuracy: 100.0%
Testing Accuracy: 93.0%


### Algoritam k-najbližih susjeda

In [307]:
knn = lambda: KNeighborsClassifier(n_neighbors=3)
klasificiraj(train_test_data, knn)

Training Accuracy: 95.3%
Testing Accuracy: 93.0%


## Rekurzivna eliminacija značajki (logistička regresija kao model, 8 značajki)

In [308]:
X = df.loc[:, df.columns != 'Diagnosis']
y = df.loc[:, 'Diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)



num_features_to_select = 8
model_for_rfe = LogisticRegression(max_iter=2500) 
rfe = RFE(estimator=model_for_rfe, n_features_to_select=num_features_to_select)

rfe.fit(X_train, y_train)

selected_features = X.columns[rfe.support_]
print("Odabrane značajke:", selected_features)
X_train_selected = rfe.transform(X_train)
X_test_selected = rfe.transform(X_test)

train_test_data = X_train_selected, X_test_selected, y_train, y_test

Odabrane značajke: Index(['Mean Radius', 'Mean Concavity', 'Perimeter SE', 'Worst Radius',
       'Worst Compactness', 'Worst Concavity', 'Worst Concave Points',
       'Worst Symmetry'],
      dtype='object')


### Logistička regresija

In [309]:
log_reg = lambda: LogisticRegression(max_iter=3000, random_state=42)
klasificiraj(train_test_data, log_reg)

Training Accuracy: 94.1%
Testing Accuracy: 97.9%


### Metoda potpornih vektora

In [310]:
svm = lambda: SVC(kernel="linear", random_state=42)
klasificiraj(train_test_data, svm)

Training Accuracy: 94.6%
Testing Accuracy: 97.9%


### Random Forest

In [311]:
rnd_forest = lambda: RandomForestClassifier(random_state=42)
klasificiraj(train_test_data, rnd_forest)

Training Accuracy: 100.0%
Testing Accuracy: 95.1%


### Naive Bayes

In [312]:
bayes = lambda: GaussianNB()
klasificiraj(train_test_data, bayes)

Training Accuracy: 93.0%
Testing Accuracy: 94.4%


### Stablo odluke

In [313]:
dec_tree = lambda: DecisionTreeClassifier(random_state=42)
klasificiraj(train_test_data, dec_tree)

Training Accuracy: 100.0%
Testing Accuracy: 93.0%


### Algoritam k-najbližih susjeda

In [314]:
knn = lambda: KNeighborsClassifier(n_neighbors=3)
klasificiraj(train_test_data, knn)

Training Accuracy: 96.7%
Testing Accuracy: 95.8%


## Analiza glavnih komponenata (PCA)

In [315]:
from sklearn.decomposition import PCA

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data.drop('Diagnosis', axis=1))

pca = PCA(n_components=15)
pca_result = pca.fit_transform(scaled_data)

pca_df = pd.DataFrame(data=pca_result, columns=[f'Principal Component {i+1}' for i in range(15)])
pca_df['Diagnosis'] = data['Diagnosis']

X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(pca_df.drop('Diagnosis', axis=1), pca_df['Diagnosis'], test_size=0.2, random_state=42)

train_test_data = X_train_pca, X_test_pca, y_train_pca, y_test_pca

### Logistička regresija

In [316]:
log_reg = lambda: LogisticRegression(max_iter=3000, random_state=42)
klasificiraj(train_test_data, log_reg)

Training Accuracy: 98.9%
Testing Accuracy: 97.4%


### Metoda potpornih vektora

In [317]:
svm = lambda: SVC(kernel="linear", random_state=42)
klasificiraj(train_test_data, svm)

Training Accuracy: 98.5%
Testing Accuracy: 98.2%


### Random Forest

In [318]:
rnd_forest = lambda: RandomForestClassifier(random_state=42)
klasificiraj(train_test_data, rnd_forest)

Training Accuracy: 100.0%
Testing Accuracy: 94.7%


### Naive Bayes

In [319]:
bayes = lambda: GaussianNB()
klasificiraj(train_test_data, bayes)

Training Accuracy: 89.7%
Testing Accuracy: 90.4%


### Stablo odluke

In [320]:
dec_tree = lambda: DecisionTreeClassifier(random_state=42)
klasificiraj(train_test_data, dec_tree)

Training Accuracy: 100.0%
Testing Accuracy: 92.1%


### Algoritam k-najbližih susjeda

In [321]:
knn = lambda: KNeighborsClassifier(n_neighbors=3)
klasificiraj(train_test_data, knn)

Training Accuracy: 98.7%
Testing Accuracy: 95.6%


## Rekurzivna eliminacija značajki (logistička regresija kao model, 15 značajki) sa Min-Max normalizacijom

In [322]:
df = data.copy()
columns_to_normalize = df.columns.difference(['Diagnosis'])
df_normalized = df.copy()
df_normalized[columns_to_normalize] = (df[columns_to_normalize] - df[columns_to_normalize].min()) / (df[columns_to_normalize].max() - df[columns_to_normalize].min())

X = df_normalized.loc[:, df_normalized.columns != 'Diagnosis']
y = df_normalized.loc[:, 'Diagnosis']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

num_features_to_select = 15
model_for_rfe = LogisticRegression(max_iter=2500) 
rfe = RFE(estimator=model_for_rfe, n_features_to_select=num_features_to_select)

rfe.fit(X_train, y_train)

selected_features = X.columns[rfe.support_]
print("Odabrane značajke:", selected_features)
X_train_selected = rfe.transform(X_train)
X_test_selected = rfe.transform(X_test)

train_test_data = X_train_selected, X_test_selected, y_train, y_test

Odabrane značajke: Index(['Mean Radius', 'Mean Texture', 'Mean Perimeter', 'Mean Area',
       'Mean Concavity', 'Mean Concave Points', 'Radius SE', 'Worst Radius',
       'Worst Texture', 'Worst Perimeter', 'Worst Area', 'Worst Smoothness',
       'Worst Concavity', 'Worst Concave Points', 'Worst Symmetry'],
      dtype='object')


### Logistička regresija

In [323]:
log_reg = lambda: LogisticRegression(max_iter=3000, random_state=42)
klasificiraj(train_test_data, log_reg)

Training Accuracy: 96.2%
Testing Accuracy: 97.9%


### Metoda potpornih vektora

In [324]:
svm = lambda: SVC(kernel="linear", random_state=42)
klasificiraj(train_test_data, svm)

Training Accuracy: 97.7%
Testing Accuracy: 97.9%


### Random Forest

In [325]:
rnd_forest = lambda: RandomForestClassifier(random_state=42)
klasificiraj(train_test_data, rnd_forest)

Training Accuracy: 100.0%
Testing Accuracy: 96.5%


### Naive Bayes

In [326]:
bayes = lambda: GaussianNB()
klasificiraj(train_test_data, bayes)

Training Accuracy: 95.3%
Testing Accuracy: 95.8%


### Stablo odluke

In [327]:
dec_tree = lambda: DecisionTreeClassifier(random_state=42)
klasificiraj(train_test_data, dec_tree)

Training Accuracy: 100.0%
Testing Accuracy: 93.7%


### Algoritam k-najbližih susjeda

In [328]:
knn = lambda: KNeighborsClassifier(n_neighbors=3)
klasificiraj(train_test_data, knn)

Training Accuracy: 97.7%
Testing Accuracy: 96.5%


# Diskusija rezultata

Prisjetimo se rezultata iz članka za svaki od korištenih algoritama.

In [329]:
rezultati_clanka = {
    'Algoritam': ['Logistic Regression', 'SVM', 'Random Forest', 'Naive Bayes', 'Decision Tree', 'KNN'],
    'Train_Accuracy': ['99.1%', '98.8%', '99.5%', '95.1%', '100.0%', '97.6%'],
    'Test_Accuracy': ['94.4%', '96.5%', '96.5%', '92.3%', '95.1%', '95.8%']
}
print(pd.DataFrame(rezultati_clanka))

             Algoritam Train_Accuracy Test_Accuracy
0  Logistic Regression          99.1%         94.4%
1                  SVM          98.8%         96.5%
2        Random Forest          99.5%         96.5%
3          Naive Bayes          95.1%         92.3%
4        Decision Tree         100.0%         95.1%
5                  KNN          97.6%         95.8%


Metodom min-max normalizacije dobivamo testing accuracy veći ili jednak onome iz članka.

In [330]:
minmax = {
    'Algoritam': ['Logistic Regression', 'SVM', 'Random Forest', 'Naive Bayes', 'Decision Tree', 'KNN'],
    'Train_Accuracy': ['96.7%', '97.4%', '100.0%', '93.7%', '100.0%', '98.1%'],
    'Train_Članak': ['99.1%', '98.8%', '99.5%', '95.1%', '100.0%', '97.6%'],
    'Test_Accuracy': ['98.6%', '98.6%', '96.5%', '95.1%', '95.1%', '97.2%'],
    'Test_Članak': ['94.4%', '96.5%', '96.5%', '92.3%', '95.1%', '95.8%']
}
print(pd.DataFrame(minmax))

             Algoritam Train_Accuracy Train_Članak Test_Accuracy Test_Članak
0  Logistic Regression          96.7%        99.1%         98.6%       94.4%
1                  SVM          97.4%        98.8%         98.6%       96.5%
2        Random Forest         100.0%        99.5%         96.5%       96.5%
3          Naive Bayes          93.7%        95.1%         95.1%       92.3%
4        Decision Tree         100.0%       100.0%         95.1%       95.1%
5                  KNN          98.1%        97.6%         97.2%       95.8%


Metodom normalizacije z-vrijednosti tj. standardizacijom podataka dobivamo testing accuracy veći ili jednak onome iz članka. Za training accuracy standardizacija daje bolje rezultate od min-max normalizacije.

In [331]:
standardizacija = {
    'Algoritam': ['Logistic Regression', 'SVM', 'Random Forest', 'Naive Bayes', 'Decision Tree', 'KNN'],
    'Train_Accuracy': ['98.8%', '98.8%', '100.0%', '93.7%', '100.0%', '98.6%'],
    'Train_Članak': ['99.1%', '98.8%', '99.5%', '95.1%', '100.0%', '97.6%'],
    'Test_Accuracy': ['97.9%', '97.2%', '96.5%', '95.1%', '95.1%', '96.5%'],
    'Test_Članak': ['94.4%', '96.5%', '96.5%', '92.3%', '95.1%', '95.8%']
}
print(pd.DataFrame(standardizacija))

             Algoritam Train_Accuracy Train_Članak Test_Accuracy Test_Članak
0  Logistic Regression          98.8%        99.1%         97.9%       94.4%
1                  SVM          98.8%        98.8%         97.2%       96.5%
2        Random Forest         100.0%        99.5%         96.5%       96.5%
3          Naive Bayes          93.7%        95.1%         95.1%       92.3%
4        Decision Tree         100.0%       100.0%         95.1%       95.1%
5                  KNN          98.6%        97.6%         96.5%       95.8%


Rekurzivna eliminacija značajki uz logističku regresiju kao model i 15 odabranih značajki daje testing accuracy veći ili jednak onome iz članka, s iznimkom stabla odluke za koji smo dobili lošiji rezultat (naših 93.0%, njihovih 95.1%)

In [332]:
rfe_log_reg_15 = {
    'Algoritam': ['Logistic Regression', 'SVM', 'Random Forest', 'Naive Bayes', 'Decision Tree', 'KNN'],
    'Train_Accuracy': ['95.3%', '96.2%', '100.0%', '93.2%', '100.0%', '95.5%'],
    'Train_Članak': ['99.1%', '98.8%', '99.5%', '95.1%', '100.0%', '97.6%'],
    'Test_Accuracy': ['97.2%', '97.2%', '96.5%', '94.4%', '93.0%', '95.8%'],
    'Test_Članak': ['94.4%', '96.5%', '96.5%', '92.3%', '95.1%', '95.8%']
}
print(pd.DataFrame(rfe_log_reg_15))

             Algoritam Train_Accuracy Train_Članak Test_Accuracy Test_Članak
0  Logistic Regression          95.3%        99.1%         97.2%       94.4%
1                  SVM          96.2%        98.8%         97.2%       96.5%
2        Random Forest         100.0%        99.5%         96.5%       96.5%
3          Naive Bayes          93.2%        95.1%         94.4%       92.3%
4        Decision Tree         100.0%       100.0%         93.0%       95.1%
5                  KNN          95.5%        97.6%         95.8%       95.8%


Rekurzivna eliminacija značajki uz SVM kao model i 15 odabranih značajki daje veći ili jednak testing accuracy za LR, SVM, RF i Naive Bayes, dok za 2-3% lošiji rezultat dobivamo za KNN i stablo odluke.

In [333]:
rfe_svm_15 = {
    'Algoritam': ['Logistic Regression', 'SVM', 'Random Forest', 'Naive Bayes', 'Decision Tree', 'KNN'],
    'Train_Accuracy': ['95.5%', '96.2%', '100.0%', '92.7%', '100.0%', '95.3%'],
    'Train_Članak': ['99.1%', '98.8%', '99.5%', '95.1%', '100.0%', '97.6%'],
    'Test_Accuracy': ['97.2%', '97.2%', '96.5%', '94.4%', '93.0%', '93.0%'],
    'Test_Članak': ['94.4%', '96.5%', '96.5%', '92.3%', '95.1%', '95.8%']
}
print(pd.DataFrame(rfe_svm_15))

             Algoritam Train_Accuracy Train_Članak Test_Accuracy Test_Članak
0  Logistic Regression          95.5%        99.1%         97.2%       94.4%
1                  SVM          96.2%        98.8%         97.2%       96.5%
2        Random Forest         100.0%        99.5%         96.5%       96.5%
3          Naive Bayes          92.7%        95.1%         94.4%       92.3%
4        Decision Tree         100.0%       100.0%         93.0%       95.1%
5                  KNN          95.3%        97.6%         93.0%       95.8%


Rekurzivna eliminacija značajki uz logističku regresiju kao model i 8 odabranih značajki daje bolji testing accuracy za LR, SVM, Naivni Bayes i KNN, dok je rezultat lošiji za Random Forest i stablo odluke. Metoda s 8 značajki većinom donosi malo bolji testing accuracy u odnosu na metodu s 15 značajki. Jedino za algoritam Random Forest metoda s 8 značajki daje lošiji rezultat (95.1%, u odnosu na 96.5% u modelu s 15 značajki).

In [334]:
rfe_log_reg_8 = {
    'Algoritam': ['Logistic Regression', 'SVM', 'Random Forest', 'Naive Bayes', 'Decision Tree', 'KNN'],
    'Train_Accuracy': ['94.1%', '94.6%', '100.0%', '93.0%', '100.0%', '96.7%'],
    'Train_Članak': ['99.1%', '98.8%', '99.5%', '95.1%', '100.0%', '97.6%'],
    'Test_Accuracy': ['97.9%', '97.9%', '95.1%', '94.4%', '93.0%', '95.8%'],
    'Test_Članak': ['94.4%', '96.5%', '96.5%', '92.3%', '95.1%', '95.8%']
}
print(pd.DataFrame(rfe_log_reg_8))

             Algoritam Train_Accuracy Train_Članak Test_Accuracy Test_Članak
0  Logistic Regression          94.1%        99.1%         97.9%       94.4%
1                  SVM          94.6%        98.8%         97.9%       96.5%
2        Random Forest         100.0%        99.5%         95.1%       96.5%
3          Naive Bayes          93.0%        95.1%         94.4%       92.3%
4        Decision Tree         100.0%       100.0%         93.0%       95.1%
5                  KNN          96.7%        97.6%         95.8%       95.8%


Analiza glavnih komponenata općenito daje bolji testing accuracy za LR i SVM, dok je za ostale algoritme rezultat lošiji od onog iz članka.

In [335]:
pca = {
    'Algoritam': ['Logistic Regression', 'SVM', 'Random Forest', 'Naive Bayes', 'Decision Tree', 'KNN'],
    'Train_Accuracy': ['98.9%', '98.5%', '100.0%', '89.7%', '100.0%', '98.7%'],
    'Train_Članak': ['99.1%', '98.8%', '99.5%', '95.1%', '100.0%', '97.6%'],
    'Test_Accuracy': ['97.4%', '98.2%', '94.7%', '90.4%', '92.1%', '95.6%'],
    'Test_Članak': ['94.4%', '96.5%', '96.5%', '92.3%', '95.1%', '95.8%']
}
print(pd.DataFrame(pca))

             Algoritam Train_Accuracy Train_Članak Test_Accuracy Test_Članak
0  Logistic Regression          98.9%        99.1%         97.4%       94.4%
1                  SVM          98.5%        98.8%         98.2%       96.5%
2        Random Forest         100.0%        99.5%         94.7%       96.5%
3          Naive Bayes          89.7%        95.1%         90.4%       92.3%
4        Decision Tree         100.0%       100.0%         92.1%       95.1%
5                  KNN          98.7%        97.6%         95.6%       95.8%


Rekurzivna eliminacija značajki uz logističku regresiju kao model i 15 odabranih značajki s normalizacijom min-max uglavnom daje veći testing accuracy od onog iz članka, osim stabla odluke s kojim smo dobili manji testing accuracy.

In [336]:
rfe_log_reg_15_minmax = {
    'Algoritam': ['Logistic Regression', 'SVM', 'Random Forest', 'Naive Bayes', 'Decision Tree', 'KNN'],
    'Train_Accuracy': ['96.2%', '97.7%', '100.0%', '95.3%', '100.0%', '97.7%'],
    'Train_Članak': ['99.1%', '98.8%', '99.5%', '95.1%', '100.0%', '97.6%'],
    'Test_Accuracy': ['97.9%', '97.9%', '96.5%', '95.8%', '93.7%', '96.5%'],
    'Test_Članak': ['94.4%', '96.5%', '96.5%', '92.3%', '95.1%', '95.8%']
}
print(pd.DataFrame(rfe_log_reg_15_minmax))

             Algoritam Train_Accuracy Train_Članak Test_Accuracy Test_Članak
0  Logistic Regression          96.2%        99.1%         97.9%       94.4%
1                  SVM          97.7%        98.8%         97.9%       96.5%
2        Random Forest         100.0%        99.5%         96.5%       96.5%
3          Naive Bayes          95.3%        95.1%         95.8%       92.3%
4        Decision Tree         100.0%       100.0%         93.7%       95.1%
5                  KNN          97.7%        97.6%         96.5%       95.8%


# Zaključak

Najbolji training accuracy dobiven je metodom normalizacije z-vrijednosti i metodom analize glavnih komponenata (PCA).

Min-max normalizacija donosi najbolji testing accuracy za sve algoritme uz iznimku naivnog Bayesa koji je malo bolji rezultat imao s metodom RFE uz logističku regresiju i min-max (obični minmax: 95.1, RFE + minmax: 95.8).

Maksimalna točnost testnog skupa uz Random Forest pokazuje se 96.5% te se uspješno dobiva sa svim pokušanim metodama osim PCA i RFE uz logističku regresiju i 8 značajki.

Maksimalna točnost testnog skupa uz stablo odluke pokazuje se 95.1% te se uspješno dobiva s min-max normalizacijom i normalizacijom z-vrijednosti.

Konačno, najbolje se pokazala kombinacija min-max normalizacije značajki i algoritam SVM.