# NLP Practice: SMS Spam Classifier

**Simple tutorial for classifiying SMS messages into two categories: "Real" or "Spam".**

The below code employs the Kaggle Dataset, "[Spam SMS Classification Using NLP](https://www.kaggle.com/datasets/mariumfaheem666/spam-sms-classification-using-nlp/data)".

---

## Imports
In this tutorial, we use the below Python packages:
*   string *(comes with most modern Python versions)*
*   pandas  
*   nltk
*   sklearn





In [1]:
# Import Pandas for Data Manipulation
import pandas as pd

# Import String and NLTK for Data Cleaning and Normalization
import string
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")

# Import SKLEARN for train/test split of the data, and...
from sklearn.model_selection import train_test_split

# ...for Text Vectorizer,
from sklearn.feature_extraction.text import CountVectorizer

# ...for Naive Bayes model,
from sklearn.naive_bayes import MultinomialNB

# ...for Accuracy Report,
from sklearn.metrics import classification_report

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True



---



## Fetch "Spam_SMS" Dataset
Download the raw CSV file from the above Kaggle hyperlink. Then, upload that file to your IDE or cloud environment's file directory. The below code will need to read this CSV file into a **Pandas Dataframe**.



```
# imports needed for the below step
import pandas as pd
```



In [3]:
# Read-in the CSV file into a Pandas Dataframe
df = pd.read_csv("/content/Spam_SMS.csv")

In [4]:
# Quickly review the data
df

Unnamed: 0,Class,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5569,spam,This is the 2nd time we have tried 2 contact u...
5570,ham,Will ü b going to esplanade fr home?
5571,ham,"Pity, * was in mood for that. So...any other s..."
5572,ham,The guy did some bitching but I acted like i'd...




---



## Clean the Data
In this step, we want to **clean, normalize, and polish** the messy text data we are given. This includes:
1.   Removing any rows with incomplete values.
2.   Normalizing each character's case (upper or lower).
3.   Removing English Stopwords from the text corpus.





In [5]:
# Drop any information that is NULL or not available, hence "dropna"
df.dropna(inplace=True)

### Cleaning Functions



```
# imports needed for the below step
import string
import nltk
from nltk.corpus import stopwords
nltk.download("stopwords")
```



In [6]:
#
# Function to normalize the text corpus to the same lowercase characters.
#

def normalizeText(text):
  return text.lower()

In [8]:
#
# Function to remove stopwords from the text corpus. Also, removes punctuation.
#

# Global Stopwords object
stop_words = set(stopwords.words('english'))

def removeStopwords(text):
  text = normalizeText(text)

  # Removes all punctuation from the text
  text = ''.join([char for char in text if char not in string.punctuation])

  # Removes all stopwords from the text
  text = ' '.join([word for word in text.split() if word not in stop_words])

  return text

In [40]:
# Let's test the cleaning process (using the [x-th] message)
x = 23
print(df.Message[x])
print(removeStopwords(df.Message[x]))

Aft i finish my lunch then i go str down lor. Ard 3 smth lor. U finish ur lunch already?
aft finish lunch go str lor ard 3 smth lor u finish ur lunch already


### Apply the Cleaning Functions

In [11]:
# Clean the entire text corpus.
df["CleanedMessage"] = df.Message.apply(removeStopwords)

In [12]:
# Normalize the ground-truth class. (I.E. Yes or No. 1 or 0.)
df["Class"] = df["Class"].map({"ham": 0, "spam": 1})
df.head()

Unnamed: 0,Class,Message,CleanedMessage
0,0,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...
1,0,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...
3,0,U dun say so early hor... U c already then say...,u dun say early hor u c already say
4,0,"Nah I don't think he goes to usf, he lives aro...",nah dont think goes usf lives around though




---



## Split and Vectorize the Data
In this step, we want to split the text corpus into two sets. Train and Test. These sets will contain a X and Y component.

X, being the SMS messages, and Y, being the ground-truth class associated with that message.

In [13]:
# Splitting the data into 2 groups: train (75%) and test (25%)

# [imports needed for this step]
# from sklearn.model_selection import train_test_split

# USAGE:
# X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=,random_state=)
#
# WHERE:
#    X: the data we will train on
#    Y: the ground_truth/correct labels for the above data

X_train, X_test, Y_train, Y_test = train_test_split(df["CleanedMessage"],df["Class"],test_size=0.25,random_state=24)

Futhermore, we will apply each of these sets to a Vectorizer. This transforms the text data into machine-readable vectors (matrix of numbers).

Once this process is complete, the Vectorizer will generate a number of "features". We can think of these features as a large spiderweb of contextual connections between the sentences/words our classifer is trained on.

In [14]:
#
# Applying our text to a Vectorizer (maps each word in a given sentence to a numerical format)
#

# [imports needed for this step]
# from sklearn.feature_extraction.text import CountVectorizer

count_vec = CountVectorizer()

# Fit the training data onto the Vectorizer
X_train = count_vec.fit_transform(X_train)

# Transform the rest of the messages onto the Vectorizer
X_test = count_vec.transform(X_test)

# Collect the unique features discovered by our Vectorizer
feature_names = count_vec.get_feature_names_out()
print(f"Number of features: {len(feature_names)}")

Number of features: 7958




---



## Apply Vectorization, then Predict() using the Naive Bayes Classifier
Now, apply the newly vectorized "X_train" and "X_test" objects to our Naive Bayes classifier.

```
# imports needed for the below step
from sklearn.naive_bayes import MultinomialNB
```

In [15]:
#
# Initialize the base "cookie-cutter" classifier we will use for predicitions.
# This is the Naive Bayes classifier.
#

nb = MultinomialNB()
nb.fit(X_train, Y_train)

Our data is now fitted to the classifier. Let's generate an accuracy!

In [None]:
# Score the classifier's accuracy.
accuracy = nb.score(X_test,Y_test)
print(f'Score: {accuracy}\n')



---





```
# imports needed for the below step
from sklearn.metrics import classification_report
```



Furthermore, we can dig deeper into the accuracy breakdown. This score is made up of the classifier's precision and recall values.

In [41]:
#
# Use our classifier to predict the ground-truth class for the testing entry. This will provide classifier accuracy.
#

# Evaluate the entire test split.
test_pred = nb.predict(X_test)

# Generate a Classification Report
print('Classification Report:')
print(classification_report(test_pred,Y_test))

[0 0 0 ... 0 0 0]
Score: 0.975609756097561

Classification Report: 
              precision    recall  f1-score   support

           0       0.99      0.98      0.99      1239
           1       0.85      0.95      0.90       155

    accuracy                           0.98      1394
   macro avg       0.92      0.96      0.94      1394
weighted avg       0.98      0.98      0.98      1394





---



## Try Your Own Custom SMS!
With the classifier trained and an accuracy score discovered, feel free to test your own custom messages and see if this classifier can detect your deception!

In [52]:
#
# Function that will predict wether or not an input string is "Spam" or "Real".
#

def testMessage(msg):
  # if the Naive Bayes prediction is equal to "array([1])"
  # this means the classifier's prediction was "[1]" (i.e. Spam)
  print('This message is likely', end=" ")

  # check 0th array index for the classifier predictions
  if nb.predict(count_vec.transform([msg]))[0] == 1:
    print('Spam')

  # else, the message is likely Real
  else:
    print('Real')

In [53]:
# Type your custom message here and test the classifier!

customMsg = "free nintendo. enter to win"
testMessage(customMsg)

This message is likely Spam
