### Ideal Steps for Machine Learning:

1. Text Pre-processing and Cleaning
2. Train Test Split
3. BOW and TF-IDF
4. Trained our Models

## Ideally run this in colab

### Libraries used earlier
- pandas -----> read_csv, get_dummies [OHE]
- nltk -------> Sent_Tokenize, Word_Tokenize
- nltk -------> PorterStemmer, WordNetLemmatizer, POS_Tag(Point of Speech)

## Libraries used in this project
- pandas -----> read_csv, get_dummies [OHE]
- nltk -------> PorterStemmer, StopWords
- skLearn ----> CountVectorizer [*BagOfWords*]
- skLearn ----> TFIDFVectorizer [*TF-IDF*]
- skLearn ----> Naive-Bayes.MultiNomialNB [Classification Model]
- skLearn ----> Metrics.ClassificationScore
- skLearn ----> Metrics.AccuracyScore

In [4]:
import pandas as pd
messages=pd.read_csv('/content/SMSSpamCollection',
                    sep='\t',names=["label","message"])

In [5]:
messages[:5]

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [6]:
## Data Cleaning And Preprocessing
import re
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [7]:


from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps=PorterStemmer()



In [8]:
corpus=[]
for i in range(0,len(messages)):
    review=re.sub('[^a-zA-z]',' ',messages['message'][i])
    review=review.lower()
    review=review.split()
    review=[ps.stem(word) for word in review if not word in stopwords.words('english')]
    review=' '.join(review)
    corpus.append(review)

In [9]:
# print the first 5 elements of the corpus in separate lines and its respective label
for i in range(5):
    print(corpus[i])
    print(messages['label'][i])

go jurong point crazi avail bugi n great world la e buffet cine got amor wat
ham
ok lar joke wif u oni
ham
free entri wkli comp win fa cup final tkt st may text fa receiv entri question std txt rate c appli
spam
u dun say earli hor u c alreadi say
ham
nah think goe usf live around though
ham


## Create the Bag Of Words for Independent Features (Vectorization)

In [10]:
## Create the Bag OF Words model
from sklearn.feature_extraction.text import CountVectorizer
## for Binary BOW enable binary=True
cv=CountVectorizer(max_features=2500,ngram_range=(1,2))

In [11]:
X=cv.fit_transform(corpus).toarray()

In [12]:
feature_names=cv.get_feature_names_out()
print(feature_names)

['aathi' 'abi' 'abiola' ... 'yup ok' 'yup thk' 'zed']


In [13]:
import numpy as np
np.set_printoptions(edgeitems=30, linewidth=100000,
                    formatter=dict(float=lambda x: "%.3g" % x))
X

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0

In [14]:
# prompt: # print a sample from the cv.vocabulary_

print(list(cv.vocabulary_.keys())[:10])
print(list(cv.vocabulary_.values())[:10])

['go', 'point', 'crazi', 'avail', 'bugi', 'great', 'world', 'la', 'cine', 'got']
[np.int64(819), np.int64(1629), np.int64(457), np.int64(123), np.int64(234), np.int64(867), np.int64(2420), np.int64(1104), np.int64(356), np.int64(858)]


## Dependent feature Vectorization

In [15]:
y=pd.get_dummies(messages['label'])

In [16]:
y=y.iloc[:,0].values

In [17]:
y

array([ True,  True, False,  True,  True, False,  True,  True, False, False,  True, False, False,  True,  True, False,  True,  True,  True, False,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True, ...,  True,  True,  True,  True,  True, False,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True, False, False,  True,  True,  True,  True])

## Split the data set into Training and Test

In [18]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [19]:
from sklearn.naive_bayes import MultinomialNB

## Train the Naive Bayes Model for Classification

In [20]:
spam_detect_model=MultinomialNB().fit(X_train, y_train)

## Predict - Test the Model

In [21]:
y_pred=spam_detect_model.predict(X_test)

In [22]:
print(X_test[:5])
print(y_pred[:5])

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
[ True  True  True  True  True]


In [23]:
from sklearn.metrics import accuracy_score, classification_report

## Verify the model accuracy


In [24]:
accuracy_score(y_pred,y_test)

0.9847533632286996

## Here are one-line explanations for each metric:
##### Precision: Measures how many of the predicted positive cases were actually correct (minimizes false positives).
#####  Recall: Measures how many of the actual positive cases were correctly identified (minimizes false negatives).
#####  F1-Score: Harmonic mean of precision and recall, providing a balanced measure when you need to consider both false positives and false negatives equally.
#####  Support: The number of actual occurrences of each class in the dataset, indicating how much data was available to evaluate each
##### Accuracy: Measures the overall percentage of correct predictions out of all predictions made.
##### Macro avg: Calculates the average of precision/recall/F1 across all classes, treating each class equally regardless of its frequency.
##### Weighted avg: Calculates the average of precision/recall/F1 across all classes, weighted by the number of samples (support) in each class.

In [25]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

       False       0.96      0.92      0.94       143
        True       0.99      0.99      0.99       972

    accuracy                           0.98      1115
   macro avg       0.98      0.96      0.97      1115
weighted avg       0.98      0.98      0.98      1115



In [26]:
y

array([ True,  True, False,  True,  True, False,  True,  True, False, False,  True, False, False,  True,  True, False,  True,  True,  True, False,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True, ...,  True,  True,  True,  True,  True, False,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True, False, False,  True,  True,  True,  True])

## Vectorization using TFIDF

In [27]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(corpus, y, test_size=0.20)

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(max_features=2500,ngram_range=(1,2))

In [29]:
X = tv.fit_transform(corpus).toarray()

In [30]:
X[:2]

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

In [31]:
print(list(cv.vocabulary_.keys())[:10])
print(list(cv.vocabulary_.values())[:10])

['go', 'point', 'crazi', 'avail', 'bugi', 'great', 'world', 'la', 'cine', 'got']
[np.int64(819), np.int64(1629), np.int64(457), np.int64(123), np.int64(234), np.int64(867), np.int64(2420), np.int64(1104), np.int64(356), np.int64(858)]


## Model training

In [48]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(corpus, y, test_size=0.20)

In [49]:
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(max_features=2500,ngram_range=(1,2))

In [50]:
X_train = tv.fit_transform(X_train).toarray()
X_test = tv.transform(X_test).toarray()

We use fit_transform() on training data and only transform() on test data to prevent data leakage and ensure proper model evaluation:
fit_transform() on X_train:

Learns the vocabulary/parameters from training data (fitting)
Applies the transformation using those learned parameters
The vectorizer learns which words exist, their frequencies, IDF values, etc.

transform() on X_test:

Only applies the transformation using parameters already learned from training data
Does not learn new vocabulary or update any statistics
Uses the same feature space as training data

Why this matters:

Prevents data leakage: Test data doesn't influence the vectorizer's learned parameters
Maintains consistency: Both datasets use identical feature mappings
Realistic evaluation: Simulates real-world scenario where new data has same preprocessing as training data
Same feature dimensions: Ensures model can make predictions on test set

If you used fit_transform() on both, the test set would influence the vocabulary/statistics, making your evaluation unrealistically optimistic and not representative of real-world performance.RetryClaude can make mistakes. Please double-check responses. Sonnet 4

In [51]:
from sklearn.naive_bayes import MultinomialNB
spam_tfidf_model=MultinomialNB().fit(X_train, y_train)

## Predict

In [52]:
spam_tfidf_model.predict(X_test)

array([ True,  True, False,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True, False,  True,  True,  True,  True, False,  True,  True,  True,  True, ...,  True, False,  True,  True,  True, False,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True,  True, False,  True,  True,  True,  True,  True,  True,  True])

In [53]:
from sklearn.metrics import accuracy_score, classification_report

## Accuracy Score

In [54]:
score = accuracy_score(y_test,y_pred)
print(score)

0.7641255605381166


In [55]:
report = classification_report(y_test,y_pred)
print(report)

              precision    recall  f1-score   support

       False       0.12      0.10      0.11       159
        True       0.85      0.87      0.86       956

    accuracy                           0.76      1115
   macro avg       0.49      0.49      0.49      1115
weighted avg       0.75      0.76      0.76      1115

