### Tugas Praktikum

**TUGAS 1** </br>
Terdapat dataset mushroom. Berdasarkan dataset yang tersebut, bandingkan peforma antara algoritma Decision Tree dan RandomForest. Gunakan tunning hyperparameter untuk mendapatkan parameter dan akurasi yang terbaik.

In [1]:
# 1. Melakukan import library yang dibutuhkan
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # import DT
from sklearn.ensemble import RandomForestClassifier # import RandomForest
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

In [4]:
# 2. Menyiapkan dataset
df = pd.read_csv('data/mushrooms.csv')
df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [5]:
# 3. Menyeleksi fitur
X = df.drop('class', axis=1)  # Menggunakan semua kolom kecuali 'class' sebagai fitur
y = df['class']  # Kolom 'class' sebagai target
y = y.map({'p': 1, 'e': 0})  # Encode label

# Melakukan One-Hot Encoding pada fitur-fitur kategorikal
X_encoded = pd.get_dummies(X, drop_first=True)
# Cek jumlah fitur dan instance
X.shape

(8124, 22)

In [6]:
# 4. Split data train dan data test
X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

In [7]:
# 5. Decision Tree
dt = DecisionTreeClassifier()
dt_param_grid = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
dt_grid_search = GridSearchCV(estimator=dt, param_grid=dt_param_grid, cv=5, n_jobs=-1, verbose=2)
dt_grid_search.fit(X_encoded, y)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


In [8]:
# Mengecek akurasi pada Decision Tree
dt_best_model = dt_grid_search.best_estimator_
dt_pred = dt_best_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)
print("Akurasi Decision Tree:", dt_accuracy)

Akurasi Decision Tree: 1.0


In [9]:
# 6. Random Forest
rf = RandomForestClassifier()
rf_param_dist = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20, 30],
    'max_features': ['auto', 'sqrt', 'log2'],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}
rf_random_search = RandomizedSearchCV(estimator=rf, param_distributions=rf_param_dist, n_iter=10, cv=5, n_jobs=-1, verbose=2)
rf_random_search.fit(X_encoded, y)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [10]:
# Mengecek akurasi pada Random Forest
rf_best_model = rf_random_search.best_estimator_
rf_pred = rf_best_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_pred)
print("Akurasi Random Forest:", rf_accuracy)

Akurasi Random Forest: 1.0


**TUGAS 2** </br>
Terdapat dataset mushroom. Berdasarkan dataset tersebut, bandingkan peforma antara algoritma Decision Tree dan AdaBoost. Gunakan tunning hyperparameter untuk mendapatkan parameter dan akurasi yang terbaik.

In [12]:
# 1. Melakukan import library
from sklearn.ensemble import AdaBoostClassifier

In [14]:
# 2. Melakukan tes akurasi dengan metode Adaboost
ab = AdaBoostClassifier(n_estimators=20)

ab.fit(X_train, y_train)

y_pred_ab = ab.predict(X_test)

acc_ab = accuracy_score(y_test, y_pred_ab)
print(f"Test set accuracy : {acc_ab}")

Test set accuracy : 1.0


**TUGAS 3**</br>
Dengan menggunakan dataset diabetes, buatlah ensemble voting dengan algoritma :</br>
1. Logistic Regression
2. SVM kernel polynomial
3. Decission Tree </br>

Anda boleh melakukan eksplorasi dengan melakukan tunning hyperparameter

In [16]:
# 1. Menyiapkan library yang dibutuhkan
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

In [18]:
# 2. Membaca dataset
df = pd.read_csv('data/diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [19]:
# Melakukan pengecekan kolom yang memiliki nilai null
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [27]:
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure',
                   'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
for column in feature_columns:
    print("============================================")
    print(f"{column} ==> Missing zeros : {len(df.loc[df[column] == 0])}")

Pregnancies ==> Missing zeros : 0
Glucose ==> Missing zeros : 0
BloodPressure ==> Missing zeros : 0
SkinThickness ==> Missing zeros : 0
Insulin ==> Missing zeros : 0
BMI ==> Missing zeros : 0
DiabetesPedigreeFunction ==> Missing zeros : 0
Age ==> Missing zeros : 0


In [28]:
from sklearn.impute import SimpleImputer

fill_values = SimpleImputer(missing_values=0, strategy="mean", copy=False)

df[feature_columns] = fill_values.fit_transform(df[feature_columns])

In [29]:
X = df[feature_columns]
y = df.Outcome

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [30]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# Assuming you have already defined X and y

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1)

logreg = LogisticRegression(max_iter=5000)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

# Calculate accuracy
accuracy = metrics.accuracy_score(y_test, y_pred)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(accuracy))

Accuracy of logistic regression classifier on test set: 0.78


In [31]:
clf = SVC(kernel="poly")
clf.fit(X_train, y_train)

ypred = clf.predict(X_test)

In [32]:
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, ypred)

print(score)

0.7792207792207793


In [41]:
dt_entropy = DecisionTreeClassifier(criterion='entropy')
dt_gini = DecisionTreeClassifier(criterion='gini')
dt_entropy.fit(X_train, y_train)
y_pred_entropy_train = dt_entropy.predict(X_train)
y_pred_entropy = dt_entropy.predict(X_test)
dt_gini.fit(X_train, y_train)
y_pred_gini_train = dt_gini.predict(X_train)
y_pred_gini = dt_gini.predict(X_test)
acc_entropy_train = accuracy_score(y_train, y_pred_entropy_train)
acc_entropy = accuracy_score(y_test, y_pred_entropy)
acc_gini_train = accuracy_score(y_train, y_pred_gini_train)
acc_gini = accuracy_score(y_test, y_pred_gini)
print(f'Akurasi Entropy Train: {acc_entropy_train}')
print(f'Akurasi Entropy: {acc_entropy}')
print('\n')
print(f'Akurasi Gini Train: {acc_gini_train}')
print(f'Akurasi Gini: {acc_gini}')

Akurasi Entropy Train: 1.0
Akurasi Entropy: 0.7077922077922078


Akurasi Gini Train: 1.0
Akurasi Gini: 0.7272727272727273
