<a href="https://colab.research.google.com/github/vleon777/Python_Basics/blob/main/week4_spam_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 **Week 4 Proyect: Detect spam/not spam email**

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

pandas → handle the dataset.

train_test_split → split into training and test sets.

CountVectorizer → turn text into numbers (vectors).

MultinomialNB → Naive Bayes model, great for text classification.

accuracy_score, classification_report → evaluate the model.

In [8]:
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
data = pd.read_csv(url, sep="\t", header=None, names=["label", "message"])

print(data.head())

  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


label = “ham” (not spam) or “spam”.

message = the SMS/email text.
data = pd.read_csv(url, sep="\t", header=None, names=["label", "message"]): This line uses the read_csv function from the pandas library (pd) to read the data from the specified url.
sep="\t": This argument tells read_csv that the values in the file are separated by tabs (\t).
header=None: This argument indicates that the file does not have a header row.
names=["label", "message"]: This argument provides a list of column names to use for the DataFrame ("label" and "message").
The resulting DataFrame is stored in the variable data.
After this line, the data DataFrame will contain the data from the URL with two columns named "label" and "message".

In [9]:
X = data["message"]     # text messages
y = data["label"]       # spam or not spam

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Machines can’t understand raw text — we transform it into vectors:

vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)

# Train The Model
model = MultinomialNB()
model.fit(X_train_counts, y_train)

y_pred = model.predict(X_test_counts)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Example new messages
new_messages = [
    "Congratulations! You won a free iPhone. Claim your prize now.",
    "Hey, are we still meeting for coffee tomorrow?",
    "Limited time offer!!! Buy 1 get 1 free.",
    "My table is red"
]

# Convert to numbers using the same vectorizer
new_counts = vectorizer.transform(new_messages)

# Predict
predictions = model.predict(new_counts)

# Show results
for msg, pred in zip(new_messages, predictions):
    print(f"Message: {msg}\nPrediction: {pred}\n")


Accuracy: 0.9919282511210762

Classification Report:
               precision    recall  f1-score   support

         ham       0.99      1.00      1.00       966
        spam       1.00      0.94      0.97       149

    accuracy                           0.99      1115
   macro avg       1.00      0.97      0.98      1115
weighted avg       0.99      0.99      0.99      1115

Message: Congratulations! You won a free iPhone. Claim your prize now.
Prediction: spam

Message: Hey, are we still meeting for coffee tomorrow?
Prediction: ham

Message: Limited time offer!!! Buy 1 get 1 free.
Prediction: ham

Message: My table is red
Prediction: ham

