## **RADI605: Modern Machine Learning**
## [In progress]
### Assignment: Adaptive Boosting
**Romen Samuel Rodis Wabina** <br>
Student, PhD Data Science in Healthcare and Clinical Informatics <br>
Clinical Epidemiology and Biostatistics, Faculty of Medicine (Ramathibodi Hospital) <br>
Mahidol University

Note: In case of Python Markdown errors, you may access the assignment through this GitHub [Link](https://github.com/rrwabina/RADI605)

### <code> Question 1-2. Please select one dataset from [UCI](https://archive.ics.uci.edu/ml/index.php). Describe the data characteristics by using appropriate statistical techniques. </code> 

In [61]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.semi_supervised import LabelSpreading
from sklearn.metrics import classification_report, confusion_matrix
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from time import time
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler
import random
import warnings
warnings.filterwarnings('ignore')
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

In [120]:
data = pd.read_csv('../data/sobar-72.csv', sep = ',', header = 0)

The dataset comprises of 72 observations and 19 characteristics (excluding the class label). The $y$ consists of two classes comprising of <code>Cervical Cancer</code> and <code>Healthy</code> classes, denoted as 1 and -1, respectively. This implies that the dependent variable $y$ is discrete and categorical in nature. 

### Data Preprocessing

In [116]:
X = data.iloc[:, 0:19].to_numpy()
y = data.iloc[:, 19].to_numpy()
y[y == 0] = -1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
scaler  = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test  = scaler.transform(X_test)

### <code> Question 3. Create an adaptive boosting classifier with decision tree and SVM by using a python sklearn package. </code>

In [94]:
def init_params():
    tuned_parameters = [{   'kernel': ['rbf'],
                            'gamma': [1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-7], 
                            'C': [0.001, 0.10, 0.1, 10, 20, 25, 50, 100, 1000]},
                        {   'kernel': ['sigmoid'], 
                            'gamma': [1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-7], 
                            'C': [0.001, 0.10, 0.1, 10, 20, 25, 50, 100, 1000]},
                        {   'kernel': ['linear'], 
                            'C': [0.001, 0.10, 0.1, 10, 20, 25, 50, 100, 1000]}]

    scoring = {'Precision': 'precision', 
               'Recall': 'recall', 
               'Accuracy': 'accuracy', 
               'AUC': 'roc_auc', 'F1': 'f1_micro'}
    return tuned_parameters, scoring

def get_svmtuning(X_train, y_train, cv = 5):
    tuned_parameters, scoring = init_params()
    random.seed(413)
    for name, score in zip(scoring.keys(), scoring.values()):
        clf = GridSearchCV(SVC(C = 1), param_grid = tuned_parameters, cv = 5,
                            scoring = score, refit = 'Accuracy',
                            return_train_score = True)
        clf.fit(X_train, y_train)
        results = clf.cv_results_
        if name == 'F1':
            print(f'Best parameter set found on development set for {name}: \t {clf.best_params_}')
        return clf.best_params_

get_svmtuning(X_train, y_train, cv = 10)

{'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}

In [111]:
clf = SVC(kernel = 'sigmoid', C = 1000, gamma = 0.0001)
adaboost_svm = AdaBoostClassifier(clf, algorithm = 'SAMME', n_estimators = 1000, random_state = 1, learning_rate = 0.0001)

In [114]:
def train_adaboost(model):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    class_names = ['Cervical Cancer', 'Healthy']
    
    confusion = confusion_matrix(y_test, y_pred)
    
    print("Adaptive Boosting score: ", accuracy_score(y_test, y_pred))
    print('Confusion Matrix : \n', confusion)
    print(classification_report(y_test, y_pred, target_names = class_names))

train_adaboost(adaboost_svm)

Adaptive Boosting score:  0.5333333333333333
Confusion Matrix : 
 [[8 0]
 [7 0]]
                 precision    recall  f1-score   support

Cervical Cancer       0.53      1.00      0.70         8
        Healthy       0.00      0.00      0.00         7

       accuracy                           0.53        15
      macro avg       0.27      0.50      0.35        15
   weighted avg       0.28      0.53      0.37        15



In [113]:
adaboost_dct = AdaBoostClassifier(DecisionTreeClassifier(max_depth = 3), algorithm = 'SAMME', n_estimators = 50, random_state = 1, learning_rate = 0.001)