# Task
predict the 'kategori' in this data using naive bayes, KNN, and SVM

Here is all the data you need:
"Dataset_Klasifikasi_Email.csv"

## Data loading

### Subtask:
Load the dataset.


**Reasoning**:
Load the dataset and display its first few rows and shape to verify the data loading.



In [7]:
import pandas as pd

try:
    df = pd.read_csv('Dataset_Klasifikasi_Email.csv')
    display(df.head())
    print(df.shape)
except FileNotFoundError:
    print("Error: 'Dataset_Klasifikasi_Email.csv' not found.")
    df = None # Indicate that the DataFrame was not successfully created
except pd.errors.ParserError:
    print("Error: Could not parse the CSV file. Please check the file format.")
    df = None
except Exception as e:
    print(f"An unexpected error occurred: {e}")
    df = None

Unnamed: 0,Email,Kategori
0,Mohon kirimkan laporan pengeluaran minggu ini,Keuangan
1,Transfer dana untuk vendor sudah dilakukan,Keuangan
2,Tagihan listrik bulan April telah diterima,Keuangan
3,Invoice untuk pembelian perangkat lunak diterima,Keuangan
4,Laporan keuangan akhir tahun harus selesai min...,Keuangan


(30, 2)


## Data exploration

### Subtask:
Explore the loaded dataset to understand its structure, identify features and target variable, check for missing values, and analyze data distribution.


**Reasoning**:
Analyze the dataset to understand its structure, identify features and target variable, check for missing values, and analyze data distribution.



In [8]:
# Display basic info
display(df.head())
print(df.info())
print(df.describe())

# Check for missing values
print("\nMissing Values:")
print(df.isnull().sum())
print(df.isnull().sum() / len(df) * 100)

# Analyze the distribution of the target variable
print("\nTarget Variable Distribution:")
print(df['Kategori'].value_counts())
print(df['Kategori'].nunique())

# Analyze text data
print("\nText Data Analysis:")
df['Email_Length'] = df['Email'].str.len()
print(df['Email_Length'].describe())

Unnamed: 0,Email,Kategori
0,Mohon kirimkan laporan pengeluaran minggu ini,Keuangan
1,Transfer dana untuk vendor sudah dilakukan,Keuangan
2,Tagihan listrik bulan April telah diterima,Keuangan
3,Invoice untuk pembelian perangkat lunak diterima,Keuangan
4,Laporan keuangan akhir tahun harus selesai min...,Keuangan


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Email     30 non-null     object
 1   Kategori  30 non-null     object
dtypes: object(2)
memory usage: 612.0+ bytes
None
                                                Email  Kategori
count                                              30        30
unique                                             30         3
top     Mohon kirimkan laporan pengeluaran minggu ini  Keuangan
freq                                                1        10

Missing Values:
Email       0
Kategori    0
dtype: int64
Email       0.0
Kategori    0.0
dtype: float64

Target Variable Distribution:
Kategori
Keuangan    10
Rapat       10
Proyek      10
Name: count, dtype: int64
3

Text Data Analysis:
count    30.000000
mean     42.600000
std       5.963105
min      30.000000
25%      40.000000
50%      42.000000
75%      45.750000
max 

## Data preparation

### Subtask:
Prepare the data for model training by handling missing values (if any) and converting categorical features into numerical representations.


**Reasoning**:
Convert the categorical target variable 'Kategori' into numerical representations using one-hot encoding, and then combine it with the numerical features derived from the email text (Email_Length) into a single DataFrame ready for model training.



In [9]:
# One-hot encode the 'Kategori' column
encoded_kategori = pd.get_dummies(df['Kategori'], prefix='Kategori')

# Concatenate the encoded categories with the numerical feature and drop original columns
df_encoded = pd.concat([df['Email_Length'], encoded_kategori], axis=1)

# Display the resulting DataFrame for verification
display(df_encoded.head())

Unnamed: 0,Email_Length,Kategori_Keuangan,Kategori_Proyek,Kategori_Rapat
0,45,True,False,False
1,42,True,False,False
2,42,True,False,False
3,48,True,False,False
4,55,True,False,False


## Data splitting

### Subtask:
Split the data into training and testing sets.


**Reasoning**:
Split the data into training and testing sets using `train_test_split`.



In [10]:
from sklearn.model_selection import train_test_split

# Define features (X) and target variable (y)
X = df_encoded[['Email_Length', 'Kategori_Keuangan', 'Kategori_Proyek', 'Kategori_Rapat']]
y = df['Kategori']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model training

### Subtask:
Train Naive Bayes, KNN, and SVM models.


**Reasoning**:
Train Naive Bayes, KNN, and SVM models using the training data.



In [11]:
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

# Initialize the classifiers
nb_model = GaussianNB()
knn_model = KNeighborsClassifier()
svm_model = SVC()

# Train the models
nb_model.fit(X_train, y_train)
knn_model.fit(X_train, y_train)
svm_model.fit(X_train, y_train)

## Model evaluation

### Subtask:
Evaluate the performance of the trained Naive Bayes, KNN, and SVM models.


**Reasoning**:
Evaluate the performance of the trained models using various metrics and generate classification reports.



In [12]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Predict on the test set for each model
nb_predictions = nb_model.predict(X_test)
knn_predictions = knn_model.predict(X_test)
svm_predictions = svm_model.predict(X_test)

# Evaluate each model
def evaluate_model(y_true, y_pred, model_name):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='weighted')
    recall = recall_score(y_true, y_pred, average='weighted')
    f1 = f1_score(y_true, y_pred, average='weighted')
    print(f"--- {model_name} ---")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-score: {f1:.4f}")
    print(classification_report(y_true, y_pred))

evaluate_model(y_test, nb_predictions, "Naive Bayes")
evaluate_model(y_test, knn_predictions, "KNN")
evaluate_model(y_test, svm_predictions, "SVM")

--- Naive Bayes ---
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1-score: 1.0000
              precision    recall  f1-score   support

    Keuangan       1.00      1.00      1.00         2
      Proyek       1.00      1.00      1.00         2
       Rapat       1.00      1.00      1.00         2

    accuracy                           1.00         6
   macro avg       1.00      1.00      1.00         6
weighted avg       1.00      1.00      1.00         6

--- KNN ---
Accuracy: 0.6667
Precision: 0.7222
Recall: 0.6667
F1-score: 0.6556
              precision    recall  f1-score   support

    Keuangan       0.67      1.00      0.80         2
      Proyek       0.50      0.50      0.50         2
       Rapat       1.00      0.50      0.67         2

    accuracy                           0.67         6
   macro avg       0.72      0.67      0.66         6
weighted avg       0.72      0.67      0.66         6

--- SVM ---
Accuracy: 0.3333
Precision: 0.3333
Recall: 0.3333
F1-score: 

## Model optimization

### Subtask:
Optimize the hyperparameters of the KNN and SVM models to improve their performance.


**Reasoning**:
Optimize the hyperparameters of the KNN and SVM models using cross-validation and then retrain them with the best hyperparameters. Evaluate the optimized models.



In [13]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

# Hyperparameter tuning for KNN
knn_param_grid = {
    'n_neighbors': range(1, 11),
    'weights': ['uniform', 'distance']
}
knn_grid_search = GridSearchCV(knn_model, knn_param_grid, cv=5)
knn_grid_search.fit(X_train, y_train)
print(f"Best KNN hyperparameters: {knn_grid_search.best_params_}")

# Hyperparameter tuning for SVM
svm_param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf', 'poly']
}
svm_grid_search = GridSearchCV(svm_model, svm_param_grid, cv=5)
svm_grid_search.fit(X_train, y_train)
print(f"Best SVM hyperparameters: {svm_grid_search.best_params_}")

# Retrain models with best hyperparameters
best_knn_model = KNeighborsClassifier(**knn_grid_search.best_params_)
best_knn_model.fit(X_train, y_train)

best_svm_model = SVC(**svm_grid_search.best_params_)
best_svm_model.fit(X_train, y_train)

# Evaluate optimized models
def evaluate_model(y_true, y_pred, model_name):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='weighted')
    recall = recall_score(y_true, y_pred, average='weighted')
    f1 = f1_score(y_true, y_pred, average='weighted')
    print(f"--- {model_name} ---")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-score: {f1:.4f}")
    print(classification_report(y_true, y_pred))

best_knn_pred = best_knn_model.predict(X_test)
best_svm_pred = best_svm_model.predict(X_test)
evaluate_model(y_test, best_knn_pred, "Optimized KNN")
evaluate_model(y_test, best_svm_pred, "Optimized SVM")

Best KNN hyperparameters: {'n_neighbors': 9, 'weights': 'uniform'}
Best SVM hyperparameters: {'C': 1, 'kernel': 'linear'}
--- Optimized KNN ---
Accuracy: 0.6667
Precision: 0.8333
Recall: 0.6667
F1-score: 0.6667
              precision    recall  f1-score   support

    Keuangan       1.00      0.50      0.67         2
      Proyek       0.50      1.00      0.67         2
       Rapat       1.00      0.50      0.67         2

    accuracy                           0.67         6
   macro avg       0.83      0.67      0.67         6
weighted avg       0.83      0.67      0.67         6

--- Optimized SVM ---
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1-score: 1.0000
              precision    recall  f1-score   support

    Keuangan       1.00      1.00      1.00         2
      Proyek       1.00      1.00      1.00         2
       Rapat       1.00      1.00      1.00         2

    accuracy                           1.00         6
   macro avg       1.00      1.00      1.00    

## Summary:

### Q&A
No questions were posed in the provided task.

### Data Analysis Key Findings
* **Data Loading and Exploration:** The dataset, "Dataset_Klasifikasi\_Email.csv", contains 30 emails categorized into 'Keuangan', 'Rapat', and 'Proyek' (10 each).  No missing values were found.  A new feature, 'Email\_Length', was engineered.
* **Data Preparation:** One-hot encoding was applied to the 'Kategori' column.  Critically, no text preprocessing was performed on the email text itself.
* **Model Training:** Three models (Naive Bayes, KNN, and SVM) were trained with default parameters.  The training data used the 'Email\_Length' feature and the one-hot encoded categories.
* **Initial Model Evaluation:** The Naive Bayes model achieved perfect accuracy (1.0000), while KNN had moderate accuracy (0.6667) and SVM performed poorly (0.3333).
* **Model Optimization:** Hyperparameter tuning improved the SVM model to perfect accuracy (1.0000), while KNN's accuracy remained at 0.6667. The best hyperparameters for KNN were `n_neighbors = 9` and `weights = 'uniform'`. The best hyperparameters for SVM were `C = 1` and `kernel = 'linear'`.


### Insights or Next Steps
* **Text Feature Engineering:** The most impactful next step would be to incorporate text features from the email content itself.  Techniques like TF-IDF or word embeddings would likely improve model performance, especially for KNN and SVM.
* **Investigate Naive Bayes' Perfect Score:** The consistently perfect score of the Naive Bayes model warrants further investigation to assess potential overfitting.  Techniques like cross-validation with a larger dataset could help determine its true generalization performance.
