### **TUGAS 1**
Terdapat dataset mushroom. Berdasarkan dataset yang tersebut, bandingkan peforma antara algoritma Decision Tree dan RandomForest. Gunakan tunning hyperparameter untuk mendapatkan parameter dan akurasi yang terbaik.

In [1]:
# Import Library
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV

In [2]:
# Load Dataset
df = pd.read_csv('data/mushrooms.csv')
df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [3]:
# data description
df.describe()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,2,6,4,10,2,9,2,2,2,12,...,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,...,s,w,w,p,w,o,p,w,v,d
freq,4208,3656,3244,2284,4748,3528,7914,6812,5612,1728,...,4936,4464,4384,8124,7924,7488,3968,2388,4040,3148


In [5]:
# data info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                8124 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring  

In [6]:
le = LabelEncoder()
df_encoded = df.apply(le.fit_transform)
X = df_encoded.drop('class', axis=1)
y = df_encoded['class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
param_grid_decision_tree = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

dt_classifier = DecisionTreeClassifier(random_state=42)
grid_search_dt = GridSearchCV(dt_classifier, param_grid_decision_tree, cv=5)
grid_search_dt.fit(X_train, y_train)

best_params = grid_search_dt.best_params_
best_score = grid_search_dt.best_score_

print("Best Parameters:", best_params)
print("Best Score:", best_score)

Best Parameters: {'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
Best Score: 1.0


In [8]:
param_grid_random_forest = {
    'n_estimators': [50, 100],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

rf_classifier = RandomForestClassifier(random_state=42)
grid_search_rf = GridSearchCV(rf_classifier, param_grid_random_forest, cv=5)
grid_search_rf.fit(X_train, y_train)

best_params = grid_search_rf.best_params_
best_score = grid_search_rf.best_score_

print("Best Parameters:", best_params)
print("Best Score:", best_score)

Best Parameters: {'bootstrap': True, 'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
Best Score: 1.0


In [9]:
best_dt_model = grid_search_dt.best_estimator_
best_dt_params = grid_search_dt.best_params_
best_dt_model.fit(X_train, y_train)

best_rf_model = grid_search_rf.best_estimator_
best_rf_params = grid_search_rf.best_params_
best_rf_model.fit(X_train, y_train)

In [10]:
y_pred_dt = best_dt_model.predict(X_test)
y_pred_rf = best_rf_model.predict(X_test)

y_train_pred_dt = best_dt_model.predict(X_train)
y_train_pred_rf = best_rf_model.predict(X_train)

accuracy_dt = accuracy_score(y_test, y_pred_dt)
accuracy_rf = accuracy_score(y_test, y_pred_rf)

accuracy_train_dt = accuracy_score(y_train, y_train_pred_dt)
accuracy_train_rf = accuracy_score(y_train, y_train_pred_rf)

In [11]:
print("Decision Tree:")
print("Best Parameters:", best_dt_params)
print("Testing Accuracy:", accuracy_dt)
print("Training Accuracy:", accuracy_train_dt)

print("\nRandom Forest:")
print("Best Parameters:", best_rf_params)
print("Testing Accuracy:", accuracy_rf)
print("Training Accuracy:", accuracy_train_rf)

Decision Tree:
Best Parameters: {'criterion': 'gini', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
Testing Accuracy: 1.0
Training Accuracy: 1.0

Random Forest:
Best Parameters: {'bootstrap': True, 'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
Testing Accuracy: 1.0
Training Accuracy: 1.0


### **TUGAS 2**
Terdapat dataset mushroom. Berdasarkan dataset tersebut, bandingkan peforma antara algoritma Decision Tree dan AdaBoost. Gunakan tunning hyperparameter untuk mendapatkan parameter dan akurasi yang terbaik.

In [12]:
param_grid_adaboost = {
    'n_estimators': [50, 100, 200],      # Kombinasi nilai yang ingin diuji untuk parameter 'n_estimators'
    'learning_rate': [0.1, 0.5, 1.0]     # Kombinasi nilai yang ingin diuji untuk parameter 'learning_rate'
}

# Membuat objek DecisionTreeClassifier sebagai estimator dasar
base_dt = DecisionTreeClassifier(random_state=42)   
# Membuat objek AdaBoostClassifier dengan menggunakan base_dt sebagai estimator dasar
adaboost_classifier = AdaBoostClassifier(n_estimators=base_dt, random_state=42)   
# Membuat objek GridSearchCV dengan menggunakan adaboost_classifier dan param_grid_adaboost
grid_search_adaboost = GridSearchCV(adaboost_classifier, param_grid_adaboost, cv=5)   
# Melakukan grid search dengan fit pada objek GridSearchCV menggunakan data pelatihan (X_train dan y_train)
grid_search_adaboost.fit(X_train, y_train)  



In [13]:
# Menggunakan atribut best_estimator_ untuk mendapatkan model terbaik yang ditemukan oleh GridSearchCV untuk AdaBoost Classifier
best_adaboost_model = grid_search_adaboost.best_estimator_   
# Melatih model terbaik dengan memanggil metode fit pada model menggunakan data pelatihan (X_train dan y_train)
best_adaboost_model.fit(X_train, y_train)   

# Menggunakan model terbaik untuk melakukan prediksi pada data pengujian (X_test) dengan memanggil metode predict
y_pred_adaboost = best_adaboost_model.predict(X_test)   
# Menghitung akurasi model dengan membandingkan hasil prediksi (y_pred_adaboost) dengan label sebenarnya (y_test) menggunakan fungsi accuracy_score
accuracy_adaboost = accuracy_score(y_test, y_pred_adaboost)  



In [14]:
print("\nAdaBoost:")
print("Best Parameters:", grid_search_adaboost.best_params_)
print("Training Accuracy:", best_adaboost_model.score(X_train, y_train))
print("Testing Accuracy:", accuracy_adaboost)


AdaBoost:
Best Parameters: {'learning_rate': 0.5, 'n_estimators': 100}
Training Accuracy: 1.0
Testing Accuracy: 1.0


### **TUGAS 3**
Dengan menggunakan dataset diabetes, buatlah ensemble voting dengan algoritma

1. Logistic Regression

2. SVM kernel polynomial

3. Decission Tree

Anda boleh melakukan eksplorasi dengan melakukan tunning hyperparameter

In [15]:
# Load Dataset
df2 = pd.read_csv('data/diabetes.csv')
df2.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [16]:
# data description
df2.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [17]:
# Mengambil semua fitur dari DataFrame df2 kecuali kolom 'Outcome' dan menyimpannya dalam variabel X
X = df2.drop('Outcome', axis=1)    
# Mengambil kolom 'Outcome' dari DataFrame df2 dan menyimpannya dalam variabel y
y = df2['Outcome']    

# Memisahkan data menjadi data pelatihan dan data pengujian menggunakan train_test_split. X_train dan y_train akan berisi data pelatihan, sedangkan X_test dan y_test akan berisi data pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

In [18]:
# Membuat objek LogisticRegression dengan menggunakan max_iter=1000 sebagai parameter
logreg = LogisticRegression(max_iter=1000)    
# Melatih model Logistic Regression dengan memanggil metode fit pada objek logreg menggunakan data pelatihan (X_train dan y_train)
logreg.fit(X_train, y_train)    

# Menggunakan model Logistic Regression yang telah dilatih untuk melakukan prediksi pada data pengujian (X_test) dengan memanggil metode predict
y_pred_logreg = logreg.predict(X_test) 

In [19]:
print('Logistic Regression:')
print('Accuracy:', accuracy_score(y_test, y_pred_logreg))
print('Classification Report:\n', classification_report(y_test, y_pred_logreg))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred_logreg))

Logistic Regression:
Accuracy: 0.7467532467532467
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.79      0.80        99
           1       0.64      0.67      0.65        55

    accuracy                           0.75       154
   macro avg       0.73      0.73      0.73       154
weighted avg       0.75      0.75      0.75       154

Confusion Matrix:
 [[78 21]
 [18 37]]


2. SVM with Polynomial Kernel

In [20]:
# Membuat objek StandardScaler yang akan digunakan untuk melakukan penskalaan fitur
scaler = StandardScaler()   

# Melakukan penskalaan fitur pada data pelatihan (X_train) dengan memanggil metode fit_transform pada objek scaler. Metode ini akan mempelajari parameter penskalaan dari data pelatihan dan kemudian melakukan penskalaan pada data pelatihan.
X_train_scaled = scaler.fit_transform(X_train)   
# Melakukan penskalaan fitur pada data pengujian (X_test) dengan memanggil metode transform pada objek scaler. Metode ini akan menggunakan parameter penskalaan yang telah dipelajari dari data pelatihan untuk melakukan penskalaan pada data pengujian.
X_test_scaled = scaler.transform(X_test)   

In [21]:
# Membuat objek SVM (Support Vector Machine) dengan kernel polinomial dan derajat polinomial 3
svm_poly = SVC(kernel='poly', degree=3)    
# Melatih model SVM dengan kernel polinomial menggunakan data pelatihan yang telah diubah skala (X_train_scaled) dan label pelatihan (y_train)
svm_poly.fit(X_train_scaled, y_train)    

# Menggunakan model SVM yang telah dilatih untuk melakukan prediksi pada data pengujian yang telah diubah skala (X_test_scaled) dengan memanggil metode predict
y_pred_svm_poly = svm_poly.predict(X_test_scaled)   

In [22]:
print('SVM with Polynomial Kernel:')
print('Accuracy:', accuracy_score(y_test, y_pred_svm_poly))
print('Classification Report:\n', classification_report(y_test, y_pred_svm_poly))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred_svm_poly))

SVM with Polynomial Kernel:
Accuracy: 0.7467532467532467
Classification Report:
               precision    recall  f1-score   support

           0       0.75      0.90      0.82        99
           1       0.72      0.47      0.57        55

    accuracy                           0.75       154
   macro avg       0.74      0.69      0.70       154
weighted avg       0.74      0.75      0.73       154

Confusion Matrix:
 [[89 10]
 [29 26]]


3. Descision Tree

In [23]:
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)

y_pred_dt = dt_classifier.predict(X_test)

In [24]:
print('Decision Tree:')
print('Accuracy:', accuracy_score(y_test, y_pred_dt))
print('Classification Report:\n', classification_report(y_test, y_pred_dt))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred_dt))

Decision Tree:
Accuracy: 0.7467532467532467
Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.76      0.79        99
           1       0.62      0.73      0.67        55

    accuracy                           0.75       154
   macro avg       0.73      0.74      0.73       154
weighted avg       0.76      0.75      0.75       154

Confusion Matrix:
 [[75 24]
 [15 40]]
