In [16]:
import pandas as pd
import joblib

In [17]:
!wget https://raw.githubusercontent.com/dicodingacademy/dicoding_dataset/refs/heads/main/employee/employee_data.csv
# 2. Load data dari raw GitHub URL
df = pd.read_csv('employee_data.csv')

# 3. Cek 5 baris awal
df.head()

--2025-05-28 14:56:06--  https://raw.githubusercontent.com/dicodingacademy/dicoding_dataset/refs/heads/main/employee/employee_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 226188 (221K) [text/plain]
Saving to: ‘employee_data.csv.3’


2025-05-28 14:56:06 (9.27 MB/s) - ‘employee_data.csv.3’ saved [226188/226188]



Unnamed: 0,EmployeeId,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,1,38,,Travel_Frequently,1444,Human Resources,1,4,Other,1,...,2,80,1,7,2,3,6,2,1,2
1,2,37,1.0,Travel_Rarely,1141,Research & Development,11,2,Medical,1,...,1,80,0,15,2,1,1,0,0,0
2,3,51,1.0,Travel_Rarely,1323,Research & Development,4,4,Life Sciences,1,...,3,80,3,18,2,4,10,0,2,7
3,4,42,0.0,Travel_Frequently,555,Sales,26,3,Marketing,1,...,4,80,1,23,2,4,20,4,4,8
4,5,40,,Travel_Rarely,1194,Research & Development,2,4,Medical,1,...,2,80,3,20,2,3,5,3,0,2


In [18]:
df_cleaned = df.copy()
df_unlabeled = df_cleaned[df_cleaned['Attrition'].isnull()]

In [19]:

# Load model Random Forest yang sudah dilatih
rf_model = joblib.load('rf_model.pkl')


# Pastikan kolom 'Attrition' tidak ada saat prediksi
if 'Attrition' in df_unlabeled.columns:
    df_unlabeled = df_unlabeled.drop('Attrition', axis=1)

# Preprocessing manual (sesuai dengan saat training)
df_unlabeled = df_unlabeled.drop(['EmployeeId', 'Over18', 'StandardHours'], axis=1)
df_unlabeled['Gender'] = df_unlabeled['Gender'].map({'Male': 0, 'Female': 1})
df_unlabeled['OverTime'] = df_unlabeled['OverTime'].map({'No': 0, 'Yes': 1})
df_unlabeled = pd.get_dummies(
    df_unlabeled,
    columns=['BusinessTravel', 'Department', 'EducationField', 'JobRole', 'MaritalStatus'],
    drop_first=True
)

# Sesuaikan kolom agar sama seperti data saat training
# (Misal: kamu perlu menyimpan X_rf_train.columns dari proses training ke file untuk digunakan di sini)
X_rf_train_columns = joblib.load('X_rf_train_columns.pkl')  # Ini harus disimpan saat training model

missing_cols = set(X_rf_train_columns) - set(df_unlabeled.columns)
for c in missing_cols:
    df_unlabeled[c] = 0

df_unlabeled = df_unlabeled[X_rf_train_columns]

# Prediksi probabilitas dan hasil awal
probs = rf_model.predict_proba(df_unlabeled)[:, 1]
initial_preds = rf_model.predict(df_unlabeled)

# Aturan custom: ubah prediksi jika probabilitas < 0.5
final_preds = []
for pred, prob in zip(initial_preds, probs):
    if prob < 0.5:
        final_preds.append(1 - pred)
    else:
        final_preds.append(pred)

# Buat DataFrame hasil prediksi
result_df = pd.DataFrame({
    'prob_actual': probs,
    'prediksi_awal': initial_preds,
    'prediksi_final': final_preds
})

# Tampilkan hasil
print(result_df)

     prob_actual  prediksi_awal  prediksi_final
0           0.33            0.0             1.0
1           0.09            0.0             1.0
2           0.20            0.0             1.0
3           0.12            0.0             1.0
4           0.17            0.0             1.0
..           ...            ...             ...
407         0.50            0.0             0.0
408         0.08            0.0             1.0
409         0.25            0.0             1.0
410         0.23            0.0             1.0
411         0.03            0.0             1.0

[412 rows x 3 columns]


1. **Preprocessing Manual**  
   Dataset `df_unlabeled` yang sebelumnya telah dipisahkan, terlebih dahulu dilakukan preprocessing secara manual agar fitur-fiturnya siap untuk diprediksi oleh model.

2. **Penerapan Model Random Forest**  
   Setelah preprocessing selesai, dataset tersebut dimasukkan ke model Random Forest yang sudah disimpan sebelumnya untuk melakukan prediksi.

3. **Output Prediksi**  
   Model menghasilkan dua output utama:  
   - **Probabilitas**: Menggambarkan seberapa besar keyakinan model terhadap prediksi kelas tertentu.  
   - **Prediksi awal**: Kelas yang diprediksi model berdasarkan probabilitas terbesar.

4. **Penerapan Aturan pada Prediksi Berdasarkan Probabilitas**  
   - Jika probabilitas prediksi awal **di bawah 0,5**, maka dianggap model kurang yakin dengan prediksi tersebut, sehingga hasil prediksi akan **dibalik** (diubah ke kelas sebaliknya).  
   - Jika probabilitas prediksi awal **di atas 0,5**, maka model dianggap memberikan prediksi yang akurat, sehingga hasil prediksi **dipertahankan** tanpa perubahan.

Dengan aturan ini, prediksi pada `df_unlabeled` menjadi lebih reliable dengan mempertimbangkan tingkat kepercayaan model pada setiap prediksi.