# ðŸ“§ Deteksi Spam Email untuk Mencegah Penipuan dan Phishing  
### *Tugas Besar Machine Learning*

---

## ðŸ‘¥ Informasi Kelompok

**Kelompok 4**

| No | Nama Lengkap          | NIM      |
| -- | --------------------- | -------- |
| 1  | Abdul Fatah           | 10222182 |
| 2  | Salwa Nurazizah       | 10222154 |
| 3  | Anisa                 | 10222134 |
| 4  | Tiara Kurniawati      | 10222155 |
| 5  | Wina Apriliani Rahayu | 10222111 |


---

## ðŸ§© Studi Kasus

Perkembangan teknologi komunikasi melalui email memberikan kemudahan dalam pertukaran informasi, namun di sisi lain juga meningkatkan risiko penyebaran email spam yang mengandung unsur penipuan dan phishing. Email jenis ini berpotensi membahayakan keamanan data pribadi serta menimbulkan kerugian bagi pengguna.

Oleh karena itu, pada tugas besar mata kuliah Machine Learning ini dikembangkan sebuah sistem deteksi spam email berbasis algoritma machine learning dengan pendekatan Natural Language Processing. Sistem ini bertujuan untuk mengklasifikasikan email ke dalam kategori spam dan non-spam (ham) secara otomatis berdasarkan pola teks yang dipelajari oleh model. Diharapkan sistem ini dapat membantu pengguna dalam menyaring email berbahaya serta meningkatkan keamanan dan kenyamanan dalam penggunaan layanan email.


## **Data Collection**

In [16]:
import pandas as pd

df = pd.read_csv("email_spam_indo.csv")
df.head()


Unnamed: 0,Kategori,Pesan
0,spam,Secara alami tak tertahankan identitas perusah...
1,spam,Fanny Gunslinger Perdagangan Saham adalah Merr...
2,spam,Rumah -rumah baru yang luar biasa menjadi muda...
3,spam,4 Permintaan Khusus Pencetakan Warna Informasi...
4,spam,"Jangan punya uang, dapatkan CD perangkat lunak..."


In [17]:
df['label'] = df['Kategori'].map({
    'spam': 1,
    'ham': 0
})


In [18]:
df[['Kategori', 'label']].head()


Unnamed: 0,Kategori,label
0,spam,1
1,spam,1
2,spam,1
3,spam,1
4,spam,1


In [19]:
df['label'].value_counts()


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,1368
0,1268


## **Text Preprocessing**

In [20]:
!pip install Sastrawi




In [21]:
import re
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory

stopwords = set(StopWordRemoverFactory().get_stop_words())

def preprocess_text(text):
    text = str(text).lower()
    text = re.sub(r'[^a-z\s]', ' ', text)
    tokens = text.split()
    tokens = [t for t in tokens if t not in stopwords]
    return ' '.join(tokens)


In [22]:
df['clean_text'] = df['Pesan'].apply(preprocess_text)


In [23]:
df[['Pesan', 'clean_text']].head()


Unnamed: 0,Pesan,clean_text
0,Secara alami tak tertahankan identitas perusah...,alami tak tertahankan identitas perusahaan san...
1,Fanny Gunslinger Perdagangan Saham adalah Merr...,fanny gunslinger perdagangan saham merrill muz...
2,Rumah -rumah baru yang luar biasa menjadi muda...,rumah rumah baru luar biasa menjadi mudah menu...
3,4 Permintaan Khusus Pencetakan Warna Informasi...,permintaan khusus pencetakan warna informasi t...
4,"Jangan punya uang, dapatkan CD perangkat lunak...",jangan punya uang dapatkan cd perangkat lunak ...


## **Feature Engineering**

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    ngram_range=(1,2),
    max_features=5000
)

X = tfidf.fit_transform(df['clean_text'])
y = df['label']


## **Modeling**

In [25]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)


In [26]:
from sklearn.linear_model import LogisticRegression

model_lr = LogisticRegression(
    class_weight='balanced',
    max_iter=1000
)

model_lr.fit(X_train, y_train)
y_pred_lr = model_lr.predict(X_test)


In [28]:
from sklearn.svm import LinearSVC

model_svm = LinearSVC(
    class_weight='balanced'
)

model_svm.fit(X_train, y_train)
y_pred_svm = model_svm.predict(X_test)


## **Evaluation**

In [29]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def evaluate(y_test, y_pred, name):
    print(f"=== {name} ===")
    print("Accuracy :", accuracy_score(y_test, y_pred))
    print("Precision:", precision_score(y_test, y_pred))
    print("Recall   :", recall_score(y_test, y_pred))
    print("F1-score :", f1_score(y_test, y_pred))
    print()

evaluate(y_test, y_pred_lr, "Logistic Regression")
evaluate(y_test, y_pred_svm, "SVM")


=== Logistic Regression ===
Accuracy : 0.9867424242424242
Precision: 0.978494623655914
Recall   : 0.9963503649635036
F1-score : 0.9873417721518988

=== SVM ===
Accuracy : 0.9886363636363636
Precision: 0.9855072463768116
Recall   : 0.9927007299270073
F1-score : 0.9890909090909091



In [30]:
def predict_email(text):
    clean = preprocess_text(text)
    vector = tfidf.transform([clean])

    lr = model_lr.predict(vector)[0]
    svm = model_svm.predict(vector)[0]

    if lr == 1 or svm == 1:
        return "SPAM"
    else:
        return "HAM"


## **Testing Model**

In [31]:
predict_email(
    "kami menawarkan posisi yang baik untuk anda klik link diatas"
)


'SPAM'

In [32]:
feature_names = tfidf.get_feature_names_out()
coef = model_lr.coef_[0]

top_spam = sorted(
    zip(coef, feature_names),
    reverse=True
)[:15]

top_ham = sorted(
    zip(coef, feature_names)
)[:15]

print("Kata spam dominan:", top_spam)
print("Kata ham dominan:", top_ham)

Kata spam dominan: [(np.float64(1.9529195529772314), 'sini'), (np.float64(1.8630682129791563), 'gratis'), (np.float64(1.8236747368059774), 'http'), (np.float64(1.795032470163804), 'klik'), (np.float64(1.7869098325448731), 'uang'), (np.float64(1.668615046489666), 'situs'), (np.float64(1.581224115521442), 'lebih'), (np.float64(1.511344080939744), 'klik sini'), (np.float64(1.4859707390450398), 'adobe'), (np.float64(1.4822640286539315), 'obat'), (np.float64(1.4809086789106973), 'viagra'), (np.float64(1.471434179687373), 'akun'), (np.float64(1.4239890230556982), 'sekarang'), (np.float64(1.3734393808635104), 'pria'), (np.float64(1.339052798019964), 'lunak')]
Kata ham dominan: [(np.float64(-4.8441402534326095), 'enron'), (np.float64(-4.39907486229114), 'vince'), (np.float64(-2.3152573072492326), 'ect'), (np.float64(-2.093913247415953), 'model'), (np.float64(-1.862698255893567), 'kaminski'), (np.float64(-1.8113418249765694), 'terima kasih'), (np.float64(-1.7658826969566457), 'penelitian'), (np

In [33]:
import joblib

models = {
    "logistic": model_lr,
    "svm": model_svm
}

joblib.dump(models, "models.pkl")
joblib.dump(tfidf, "tfidf.pkl")


['tfidf.pkl']

In [34]:
from google.colab import files

files.download("tfidf.pkl")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [35]:
!ls


email_spam_indo.csv  models.pkl  sample_data  tfidf.pkl


In [36]:
!zip spam_model.zip models.pkl tfidf.pkl


  adding: models.pkl (deflated 11%)
  adding: tfidf.pkl (deflated 72%)
