<a href="https://colab.research.google.com/github/yumnaehab-tech/data-science-projects/blob/main/spam_dataset_yumna.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!unzip "/content/archive (1).zip" -d /content/


Archive:  /content/archive (1).zip
  inflating: /content/spam.csv       


In [None]:
import os
print(os.listdir("/content"))


['.config', 'archive (1).zip', 'spam.csv', 'sample_data']


In [None]:
import pandas as pd

data = pd.read_csv("/content/spam.csv", encoding='latin-1')
print(data.head())


     v1                                                 v2 Unnamed: 2  \
0   ham  Go until jurong point, crazy.. Available only ...        NaN   
1   ham                      Ok lar... Joking wif u oni...        NaN   
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...        NaN   
3   ham  U dun say so early hor... U c already then say...        NaN   
4   ham  Nah I don't think he goes to usf, he lives aro...        NaN   

  Unnamed: 3 Unnamed: 4  
0        NaN        NaN  
1        NaN        NaN  
2        NaN        NaN  
3        NaN        NaN  
4        NaN        NaN  


In [None]:
# ==========================================
# Spam Email Detection using Logistic Regression
# Created by Yumna Ehab
# ==========================================

# Step 1: Basic imports
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Step 2: Load dataset
# (We already uploaded spam.csv)
data = pd.read_csv("/content/spam.csv", encoding='latin-1')

# Quick look at the data
print("Data shape:", data.shape)
print(data.head(2))

# Step 3: Clean the dataset
# Some columns are extra, so we'll only keep the main two
data = data[['v1', 'v2']]
data.columns = ['label', 'message']

# Convert labels to numbers: ham=0, spam=1
data['label'] = data['label'].map({'ham': 0, 'spam': 1})

# Step 4: Split data
X_train, X_test, y_train, y_test = train_test_split(
    data['message'], data['label'], test_size=0.2, random_state=42
)

# Step 5: Convert text to numerical form (TF-IDF)
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.8)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Step 6: Train the model
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)

# Step 7: Predict and evaluate
y_pred = model.predict(X_test_tfidf)

accuracy = accuracy_score(y_test, y_pred)
print("\nModel Accuracy:", round(accuracy * 100, 2), "%")
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Step 8: Try a few sample predictions
sample_messages = [
    "Congratulations! You've won a $1000 Walmart gift card. Click to claim!",
    "Hey, can we meet tomorrow at 10?",
    "URGENT! Your account has been suspended. Verify now."
]

sample_features = vectorizer.transform(sample_messages)
predictions = model.predict(sample_features)

for msg, pred in zip(sample_messages, predictions):
    print(f"\nMessage: {msg}")
    print("Prediction:", "SPAM 🚫" if pred == 1 else "HAM ✅")

# Step 9: Summary
print("\nThe model works well for detecting spam messages using text features.")
print("Next step: try Naive Bayes or SVM to compare results.")


Data shape: (5572, 5)
    v1                                                 v2 Unnamed: 2  \
0  ham  Go until jurong point, crazy.. Available only ...        NaN   
1  ham                      Ok lar... Joking wif u oni...        NaN   

  Unnamed: 3 Unnamed: 4  
0        NaN        NaN  
1        NaN        NaN  

Model Accuracy: 95.25 %

Classification Report:
               precision    recall  f1-score   support

           0       0.95      1.00      0.97       965
           1       0.97      0.67      0.79       150

    accuracy                           0.95      1115
   macro avg       0.96      0.83      0.88      1115
weighted avg       0.95      0.95      0.95      1115


Message: Congratulations! You've won a $1000 Walmart gift card. Click to claim!
Prediction: SPAM 🚫

Message: Hey, can we meet tomorrow at 10?
Prediction: HAM ✅

Message: URGENT! Your account has been suspended. Verify now.
Prediction: HAM ✅

The model works well for detecting spam messages using text fea