**Klasifikasi Teks**

## Mengambil dataset teks

Karena tidak semua sertifikat yang dilabeling, maka lakukan filtering agar hanya memfilter teks ocr yang sudah dilabeling

In [17]:
import pandas as pd
from pathlib import Path

def filter_data(file_all: str,
                file_id_filter: str,
                id_col: str = "id") -> set[str] | None:
    """
    Mem‑filter `file_all` agar hanya berisi baris‑baris dengan ID
    yang tercantum di `file_id_filter`.

    Parameters
    ----------
    file_all : str
        Nama / path CSV utama yang ingin disusutkan.
    file_id_filter : str
        Nama / path CSV yang berisi kolom ID patokan.
    id_col : str, default 'id'
        Nama kolom ID pada kedua file.

    Returns
    -------
    set[str] | None
        Himpunan ID patokan (jika sukses), else None.
    """
    try:
        # --- baca file ID patokan ------------------------------------------------
        id_filter_df = pd.read_csv(file_id_filter)
        print(f"Membaca '{file_id_filter}' ...")

        target_ids: set[str] = set(id_filter_df[id_col].dropna().unique())
        print(f"Ditemukan {len(target_ids):,} ID unik di '{file_id_filter}'")

        # --- baca file utama -----------------------------------------------------
        all_df = pd.read_csv(file_all)
        print(f"Total baris di '{file_all}': {len(all_df):,}")

        # --- lakukan filtering ---------------------------------------------------
        filtered_df = all_df[all_df[id_col].isin(target_ids)]
        print(f"Baris setelah filtering: {len(filtered_df):,}")

        # --- simpan hasil --------------------------------------------------------
        output_filename = f"{Path(file_all).stem}_filtered.csv"
        filtered_df.to_csv(output_filename, index=False)
        print(f"File hasil disimpan sebagai: '{output_filename}'")
        print("\n=== STATISTIK FILTERING ===")
        print(f"Total baris di {file_all}: {len(all_df):,}")
        print(f"Total ID unik di {file_id_filter}: {len(target_ids):,}")
        print(f"Baris yang cocok & dipertahankan: {len(filtered_df):,}")
        perc = len(filtered_df) / len(all_df) * 100 if len(all_df) else 0
        print(f"Persentase data yang dipertahankan: {perc:.2f}%")
        return target_ids

    except FileNotFoundError as e:
        print(f"❌ File tidak ditemukan: {e.filename}")
    except Exception as e:
        print(f"❌ Error tak terduga: {e}")

    return None


def verify_filter(file_all_filtered: str,
                  file_id_filter: str,
                  id_col: str = "id") -> None:
    """
    Memverifikasi apakah seluruh ID di `file_id_filter`
    benar‑benar ada di `file_all_filtered`.
    """
    try:
        id_filter_df = pd.read_csv(file_id_filter)
        filtered_df = pd.read_csv(file_all_filtered)

        filter_ids = set(id_filter_df[id_col].dropna().unique())
        kept_ids    = set(filtered_df[id_col].dropna().unique())

        missing_ids = filter_ids - kept_ids
        extra_ids   = kept_ids  - filter_ids

        print("\n=== VERIFIKASI HASIL ===")
        print(f"ID di {file_id_filter}: {len(filter_ids):,}")
        print(f"ID di {file_all_filtered}: {len(kept_ids):,}")
        print(f"ID yang hilang: {len(missing_ids):,}")
        print(f"ID tambahan (tak ada di filter): {len(extra_ids):,}")

        if missing_ids:
            print("\nID yang hilang:")
            for i, mid in enumerate(sorted(missing_ids), 1):
                print(f"{i:3d}. {mid}")

        if not missing_ids and not extra_ids:
            print("✅ Filtering sempurna—semua ID cocok.")
        elif not missing_ids:
            print("✅ Semua ID utama ditemukan; ada ID ekstra.")
        else:
            print("⚠️  Ada ID yang belum ter‑match.")

    except FileNotFoundError as e:
        print(f"❌ File tidak ditemukan saat verifikasi: {e.filename}")
    except Exception as e:
        print(f"❌ Error tak terduga saat verifikasi: {e}")


FILE_ALL = "rawOcr.csv"
FILE_ID_FILTER = "label.csv"
ids = filter_data(FILE_ALL, FILE_ID_FILTER)
if ids is not None:
    verify_filter(f"{Path(FILE_ALL).stem}_filtered.csv", FILE_ID_FILTER)
    print("\n=== SELESAI ===")
else:
    print("Proses filtering gagal. Lihat pesan error di atas.")

Membaca 'label.csv' ...
Ditemukan 645 ID unik di 'label.csv'
Total baris di 'rawOcr.csv': 10,751
Baris setelah filtering: 626
File hasil disimpan sebagai: 'rawOcr_filtered.csv'

=== STATISTIK FILTERING ===
Total baris di rawOcr.csv: 10,751
Total ID unik di label.csv: 645
Baris yang cocok & dipertahankan: 626
Persentase data yang dipertahankan: 5.82%

=== VERIFIKASI HASIL ===
ID di label.csv: 645
ID di rawOcr_filtered.csv: 626
ID yang hilang: 19
ID tambahan (tak ada di filter): 0

ID yang hilang:
  1. 423002052-2-1
  2. 423003664-3-1
  3. 423008153-1-1
  4. 423008153-2-1
  5. 423010193-1-1
  6. 423010193-2-1
  7. 423101555-3-1
  8. 423101555-3-10
  9. 423101555-3-11
 10. 423101555-3-2
 11. 423101555-3-3
 12. 423101555-3-6
 13. 423102517-1-1
 14. 423103256-1-1
 15. 423103256-2-1
 16. 423103256-3-1
 17. 423105655-1-1
 18. 423106044-1-3
 19. 423106044-1-6
⚠️  Ada ID yang belum ter‑match.

=== SELESAI ===


lakukan juga sebaliknya agar kedua csv sama banyak barisnya

In [18]:
FILE_ALL = "label.csv"
FILE_ID_FILTER = "rawOcr_filtered.csv"
ids = filter_data(FILE_ALL, FILE_ID_FILTER)
if ids is not None:
    verify_filter(f"{Path(FILE_ALL).stem}_filtered.csv", FILE_ID_FILTER)
    print("\n=== SELESAI ===")
else:
    print("Proses filtering gagal. Lihat pesan error di atas.")

Membaca 'rawOcr_filtered.csv' ...
Ditemukan 626 ID unik di 'rawOcr_filtered.csv'
Total baris di 'label.csv': 645
Baris setelah filtering: 626
File hasil disimpan sebagai: 'label_filtered.csv'

=== STATISTIK FILTERING ===
Total baris di label.csv: 645
Total ID unik di rawOcr_filtered.csv: 626
Baris yang cocok & dipertahankan: 626
Persentase data yang dipertahankan: 97.05%

=== VERIFIKASI HASIL ===
ID di rawOcr_filtered.csv: 626
ID di label_filtered.csv: 626
ID yang hilang: 0
ID tambahan (tak ada di filter): 0
✅ Filtering sempurna—semua ID cocok.

=== SELESAI ===


In [19]:
# Hilangkan tanda pagar (#) pada baris berikut untuk menginstall library yang diperlukan
# %pip install matplotlib scikit-learn

In [20]:
import pandas as pd
import numpy as np
import re
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

satukan label dan hasil ocr

In [21]:
# Membuat DataFrame
df_ocr = pd.read_csv('label_filtered.csv', usecols=['id', 'tingkat-prestasi'])
df_target = pd.read_csv('rawOcr_filtered.csv')

# Menggabungkan data
df = pd.merge(df_ocr, df_target, on='id')
df.head(5)

Unnamed: 0,id,tingkat-prestasi,fulltext
0,424698876-1-1,Internasional,0 Kabax Dacademy Oones {ANGaROO CERTIFICATE OF...
1,424690287-1-1,Internasional,SCOUTS Creating a Better World sjolo 20 23 Jam...
2,424641786-3-1,Internasional,5 2= 7 << {Panicipan Ceruflcaite No : 0910791...
3,424640286-1-1,Internasional,UNIDA GONTOR CERTIFICATE OF ACHIEVEMENT 2616/U...
4,424617408-3-1,Internasional,Student Executive Board Faculty of Dental Medi...


In [22]:
print(f"\nDistribusi tingkat prestasi:")
print(df['tingkat-prestasi'].value_counts())


Distribusi tingkat prestasi:
tingkat-prestasi
Nasional             205
Kabupaten/Kota       109
Provinsi             108
Internasional        102
Tidak Terdefinisi    102
Name: count, dtype: int64


## Text Preprocessing dan 

In [23]:
%pip install nltk Sastrawi

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: C:\Users\lenovo\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [24]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words_eng = set(stopwords.words('english'))
stop_words_indo = set(stopwords.words('indonesian'))
stop_words = stop_words_eng.union(stop_words_indo)


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [25]:
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory

factory = StopWordRemoverFactory()
stop_words_sastrawi = set(factory.get_stop_words())
all_stop_words = stop_words_eng.union(stop_words_indo).union(stop_words_sastrawi)

In [26]:
print(all_stop_words)

{'berujar', 'sekecil', 'am', 'mengerjakan', 'agar', 'them', 'itu', 'lanjutnya', 'jadinya', 'anda', 'dimana', 'tunjuk', 'ungkap', 'usai', 'tertentu', 'ialah', 'sebutlah', 'having', 'atau', 'terjadilah', 'jadi', 'bersama', 'dimulainya', 'kapankah', 'ada', 'makin', 'seberapa', 'menanti-nanti', 'banyak', 'kamu', 'from', 'cukupkah', 'if', 'ditunjuki', 'disebut', 'most', 'semula', 'mana', 'sendirinya', 'usah', 'satu', 'masalah', 'pasti', 'karena', 'semisalnya', 'above', 'beginilah', 'diperkirakan', 'don', 'wahai', 'siapa', 'diakhiri', 'memang', 'sebaik-baiknya', 'ujarnya', 'menuju', 'our', 'menanyakan', 'ditunjukkannya', 'nyaris', 'ditandaskan', 'dini', 'rasa', 'hendak', 'sekarang', 'seingat', 'ditanya', 'apakah', "it's", 'couldn', 'berikutnya', 'dekat', 'diketahui', 'antaranya', 'benar', 'cukup', 'sepantasnya', 'there', 'setengah', 'ditegaskan', 'tapi', 'demikianlah', 'siapakah', 'menunjuknya', 'sebagai', 'during', 'toh', 'betulkah', 'didn', 'merasa', 'you', 'ia', 'kamulah', 'mereka', 'meny

In [27]:
def clean_ocr_text(text):
    """OCR cleaning - menghilangkan noise OCR"""
    # Menghilangkan karakter encoding yang salah
    text = re.sub(r'Ã[ƒÂ§±„]', '', text)
    
    # Menghilangkan karakter non-alfanumerik hasil OCR
    text = re.sub(r'[^\w\s]', ' ', text)
    
    # Menghilangkan angka yang berdiri sendiri
    text = re.sub(r'\b\d+\b', '', text)
    
    # Menghilangkan whitespace berlebih
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

In [28]:
def casefolding(text):
    """Convert to lowercase"""
    return text.lower()

In [29]:
def remove_punctuation(text):
    """Remove punctuation"""
    return text.translate(str.maketrans('', '', string.punctuation))

In [30]:
def remove_stopwords(text):
    """Remove stopwords"""
    words = text.split()
    return ' '.join([word for word in words if word not in all_stop_words])

In [31]:
def preprocess_text(text, cleaning=True, casefolding_flag=True,
                   punctuation_removal=True, stopword_removal=False):
    """Preprocessing text dengan berbagai opsi"""
    # Handle NaN values
    if pd.isna(text):
        return ""
    
    text = str(text)  # Pastikan text adalah string
    
    if cleaning:
        text = clean_ocr_text(text)
   
    if casefolding_flag:
        text = casefolding(text)
   
    if punctuation_removal:
        text = remove_punctuation(text)
   
    if stopword_removal:
        text = remove_stopwords(text)
   
    return text

In [32]:
def evaluate_knn_model(X, y, model_name, k=3):
    """Evaluasi model KNN dengan cross-validation"""
    print(f"\n{'='*50}")
    print(f"EVALUASI MODEL: {model_name}")
    print(f"{'='*50}")
    
    # Split data (80-20)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    
    # KNN Classifier
    knn = KNeighborsClassifier(n_neighbors=k, metric='cosine')
    knn.fit(X_train, y_train)
    
    # Prediksi
    y_pred = knn.predict(X_test)
    
    # Evaluasi
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Akurasi: {accuracy:.4f}")
    
    # Cross-validation
    cv_scores = cross_val_score(knn, X, y, cv=3, scoring='accuracy')
    print(f"Cross-validation scores: {cv_scores}")
    print(f"Mean CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
    
    # Classification report
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    
    # Confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    print("\nConfusion Matrix:")
    print(cm)
    
    return {
        'model_name': model_name,
        'accuracy': accuracy,
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std(),
        'classification_report': classification_report(y_test, y_pred, output_dict=True)
    }

In [33]:
# Label encoding untuk target
le = LabelEncoder()
y_encoded = le.fit_transform(df['tingkat-prestasi'])

print(f"\nLabel mapping:")
for i, label in enumerate(le.classes_):
    print(f"{i}: {label}")


Label mapping:
0: Internasional
1: Kabupaten/Kota
2: Nasional
3: Provinsi
4: Tidak Terdefinisi


In [34]:
# Inisialisasi TF-IDF Vectorizer
tfidf = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))

results = []

In [35]:
#Contoh
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    "saya suka makan nasi",
    "nasi goreng enak sekali",
    "saya makan goreng"
]

vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))
X = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print(X.toarray())

['enak' 'enak sekali' 'goreng' 'goreng enak' 'makan' 'makan goreng'
 'makan nasi' 'nasi' 'nasi goreng' 'saya' 'saya makan' 'saya suka'
 'sekali' 'suka' 'suka makan']
[[0.         0.         0.         0.         0.31757018 0.
  0.41756662 0.31757018 0.         0.31757018 0.         0.41756662
  0.         0.41756662 0.41756662]
 [0.40301621 0.40301621 0.30650422 0.40301621 0.         0.
  0.         0.30650422 0.40301621 0.         0.         0.
  0.40301621 0.         0.        ]
 [0.         0.         0.3935112  0.         0.3935112  0.51741994
  0.         0.         0.         0.3935112  0.51741994 0.
  0.         0.         0.        ]]


In [36]:
# Eksperimen 1: OCR tanpa cleaning
print("\n" + "="*80)
print("EKSPERIMEN 1: OCR Tanpa Cleaning")
print("="*80)

df['text_exp1'] = df['fulltext']  # Tidak ada preprocessing sama sekali

print("Contoh teks tanpa preprocessing:")
for i in range(len(df)):
    print(f"Original: {df['text_exp1'].iloc[i][:100]}...")
    print()

X_exp1 = tfidf.fit_transform(df['text_exp1'])
print(tfidf.get_feature_names_out())
print(X_exp1.toarray())
result1 = evaluate_knn_model(X_exp1, y_encoded, "Exp 1: OCR Tanpa Cleaning")
results.append(result1)


EKSPERIMEN 1: OCR Tanpa Cleaning
Contoh teks tanpa preprocessing:
Original: 0 Kabax Dacademy Oones {ANGaROO CERTIFICATE OF ACHIEVEMENT B-02-THEREDIIKLCINDONESIA2022 INTERNATION...

Original: SCOUTS Creating a Better World sjolo 20 23 Jamboree JoN THE AIR @N THE [HTERNET CERTIFICATE OF PARTI...

Original: 5 2= 7 << {Panicipan  Ceruflcaite No : 0910791 /s/YPI/SEAOSM/vı/2023 THIS CERTIFICATE IS PROVIDED TO...

Original: UNIDA GONTOR CERTIFICATE OF ACHIEVEMENT 2616/UNIDA Gontor/FTID-h/II/1445 This certificate is proudly...

Original: Student Executive Board Faculty of Dental Medicine Airlangga University THIS CERTIFICATE NO : A4/DEN...

Original: STLG International cPaopionship Cethcale PEN 2022 0/ Cewpelftion Onlino Competllon 19-20 March HAGAN...

Original: FORKI NTERNATIONAL KARATE CHAMPIONSHIP "YOGiHKHRTH OPEN TOURNAKINT 12028" tournamenti PIAGAM PRESTAS...

Original: @U ISC international Science Olynpiad Certäficate GXPo EDUCATION EXPO NO. 2000201/EDUEXPO/ILTI/ISO-8...

Original: PER

In [37]:
# Eksperimen 2: OCR cleaning + casefolding + punctuation removal
print("\n" + "="*80)
print("EKSPERIMEN 2: OCR Cleaning + Casefolding + Punctuation Removal")
print("="*80)

df['text_exp2'] = df['fulltext'].apply(
    lambda x: preprocess_text(x, cleaning=True, casefolding_flag=True, 
                             punctuation_removal=True, stopword_removal=False)
)

print("Contoh preprocessing Eksperimen 2:")
for i in range(len(df)):
    print(f"Original: {df['fulltext'].iloc[i][:100]}...")
    print(f"Processed: {df['text_exp2'].iloc[i][:100]}...")
    print()

X_exp2 = tfidf.fit_transform(df['text_exp2'])
result2 = evaluate_knn_model(X_exp2, y_encoded, "Exp 2: OCR Cleaning + Casefolding + Punctuation Removal")
results.append(result2)


EKSPERIMEN 2: OCR Cleaning + Casefolding + Punctuation Removal
Contoh preprocessing Eksperimen 2:
Original: 0 Kabax Dacademy Oones {ANGaROO CERTIFICATE OF ACHIEVEMENT B-02-THEREDIIKLCINDONESIA2022 INTERNATION...
Processed: kabax dacademy oones angaroo certificate of achievement b therediiklcindonesia2022 international kan...

Original: SCOUTS Creating a Better World sjolo 20 23 Jamboree JoN THE AIR @N THE [HTERNET CERTIFICATE OF PARTI...
Processed: scouts creating a better world sjolo jamboree jon the air n the hternet certificate of participation...

Original: 5 2= 7 << {Panicipan  Ceruflcaite No : 0910791 /s/YPI/SEAOSM/vı/2023 THIS CERTIFICATE IS PROVIDED TO...
Processed: panicipan ceruflcaite no s ypi seaosm vı this certificate is provided to alifah muzayyanah ansar sma...

Original: UNIDA GONTOR CERTIFICATE OF ACHIEVEMENT 2616/UNIDA Gontor/FTID-h/II/1445 This certificate is proudly...
Processed: unida gontor certificate of achievement unida gontor ftid h ii this certificate is pro

In [38]:
# Eksperimen 3: OCR cleaning + casefolding + punctuation removal + stopword removal
print("\n" + "="*80)
print("EKSPERIMEN 3: OCR Cleaning + Casefolding + Punctuation Removal + Stopword Removal")
print("="*80)

df['text_exp3'] = df['fulltext'].apply(
    lambda x: preprocess_text(x, cleaning=True, casefolding_flag=True, 
                             punctuation_removal=True, stopword_removal=True)
)

print("Contoh preprocessing Eksperimen 3:")
for i in range(len(df)):
    print(f"Original: {df['fulltext'].iloc[i][:100]}...")
    print(f"Processed: {df['text_exp3'].iloc[i][:100]}...")
    print()

X_exp3 = tfidf.fit_transform(df['text_exp3'])
result3 = evaluate_knn_model(X_exp3, y_encoded, "Exp 3: OCR Cleaning + Casefolding + Punctuation Removal + Stopword Removal")
results.append(result3)



EKSPERIMEN 3: OCR Cleaning + Casefolding + Punctuation Removal + Stopword Removal
Contoh preprocessing Eksperimen 3:
Original: 0 Kabax Dacademy Oones {ANGaROO CERTIFICATE OF ACHIEVEMENT B-02-THEREDIIKLCINDONESIA2022 INTERNATION...
Processed: kabax dacademy oones angaroo certificate achievement b therediiklcindonesia2022 international kangar...

Original: SCOUTS Creating a Better World sjolo 20 23 Jamboree JoN THE AIR @N THE [HTERNET CERTIFICATE OF PARTI...
Processed: scouts creating better world sjolo jamboree jon air n hternet certificate participation certificate ...

Original: 5 2= 7 << {Panicipan  Ceruflcaite No : 0910791 /s/YPI/SEAOSM/vı/2023 THIS CERTIFICATE IS PROVIDED TO...
Processed: panicipan ceruflcaite ypi seaosm vı certificate provided alifah muzayyanah ansar sman bulukumba prov...

Original: UNIDA GONTOR CERTIFICATE OF ACHIEVEMENT 2616/UNIDA Gontor/FTID-h/II/1445 This certificate is proudly...
Processed: unida gontor certificate achievement unida gontor ftid h ii certifi

In [39]:
# Ringkasan hasil
print("\n" + "="*80)
print("RINGKASAN HASIL EKSPERIMEN")
print("="*80)

results_df = pd.DataFrame([
    {
        'Eksperimen': r['model_name'],
        'Akurasi': f"{r['accuracy']:.4f}",
        'CV Mean': f"{r['cv_mean']:.4f}",
        'CV Std': f"{r['cv_std']:.4f}"
    } for r in results
])

print(results_df.to_string(index=False))



RINGKASAN HASIL EKSPERIMEN
                                                                Eksperimen Akurasi CV Mean CV Std
                                                 Exp 1: OCR Tanpa Cleaning  0.7143  0.6564 0.0563
                   Exp 2: OCR Cleaning + Casefolding + Punctuation Removal  0.7063  0.6837 0.0280
Exp 3: OCR Cleaning + Casefolding + Punctuation Removal + Stopword Removal  0.7222  0.6853 0.0309


In [42]:
# Analisis TF-IDF features
print("\n" + "="*80)
print("ANALISIS TF-IDF FEATURES")
print("="*80)

# Menampilkan top features untuk setiap eksperimen
for i, (X, exp_name) in enumerate([(X_exp1, "Exp 1"), (X_exp2, "Exp 2"), (X_exp3, "Exp 3")]):
    print(f"\nTop 10 TF-IDF features untuk {exp_name}:")
    
    # Refit TF-IDF untuk mendapatkan feature names
    if i == 0:
        texts = df['text_exp1']
    elif i == 1:
        texts = df['text_exp2']
    else:
        texts = df['text_exp3']
    
    tfidf_temp = TfidfVectorizer(max_features=1000, ngram_range=(1, 2))
    X_temp = tfidf_temp.fit_transform(texts)
    
    # Mendapatkan rata-rata TF-IDF score untuk setiap feature
    feature_names = tfidf_temp.get_feature_names_out()
    mean_scores = np.array(X_temp.mean(axis=0)).flatten()
    
    # Sort berdasarkan score tertinggi
    top_indices = mean_scores.argsort()[-10:][::-1]
    
    for idx in top_indices:
        print(f"  {feature_names[idx]}: {mean_scores[idx]:.4f}")



ANALISIS TF-IDF FEATURES

Top 10 TF-IDF features untuk Exp 1:
  2023: 0.0542
  2022: 0.0498
  2021: 0.0453
  of: 0.0391
  dan: 0.0365
  sma: 0.0337
  nasional: 0.0334
  indonesia: 0.0328
  olimpiade: 0.0321
  kepada: 0.0315

Top 10 TF-IDF features untuk Exp 2:
  of: 0.0388
  dan: 0.0386
  sma: 0.0354
  nasional: 0.0347
  indonesia: 0.0343
  olimpiade: 0.0336
  kepada: 0.0332
  pada: 0.0327
  sebagai: 0.0322
  diberikan: 0.0318

Top 10 TF-IDF features untuk Exp 3:
  sma: 0.0367
  indonesia: 0.0362
  nasional: 0.0361
  olimpiade: 0.0354
  makassar: 0.0331
  ketua: 0.0325
  sulawesi: 0.0314
  sains: 0.0305
  sertifikat: 0.0297
  tingkat: 0.0294
