Skip to content

classify SMS and email messages as spam or not spam using machine learning techniques end-to-end project

License

Notifications You must be signed in to change notification settings

santos-k/SMS-SPAM-Classifier

Repository files navigation

SMS/Email Spam Classification WebApp

View Demo · Report Bug · Request Feature · Ask Question

Overview

This analysis was conducted to classify SMS and email messages as spam or not spam using machine learning techniques. The dataset used for this analysis is the SMS Spam Collection dataset from the Kaggle, which contains 5,574 SMS messages and labels indicating whether each message is spam or not.

The first step in the project was to preprocess the data by removing duplicate rows, converting the target column to binary values, and performing natural language processing techniques on the message text. The model building process began by trying different machine learning algorithms, starting with Naive Bayes. To make the text data suitable for the model, the text was converted to numerical data using vectorization techniques such as Bag of Words and TF-IDF.

The evaluation metrics used were accuracy, precision, recall, and F1 score. The main focus was on the precision score, as the goal of the model is to minimize the number of false positive predictions (i.e. not spam messages classified as spam). The best results were obtained using the MultinomialNB algorithm, which achieved 95.9% accuracy and 100% precision score.

In conclusion, the MultinomialNB model demonstrated superior performance in classifying SMS and email messages as spam or not spam, as it achieved a precision score of 100%. This suggests that the MultinomialNB model is an effective method for classifying SMS and email messages as spam or not spam and minimizing the number of false positive predictions.

Project Files/Requirements

  1. assests - assests folder
  2. app.py - main python file
  3. prediction.py - for new prediction
  4. analysis.py - for analysis charts page
  5. requirements.txt - all required packages list file
  6. Procfile - require for heroku deployment
  7. model.pkl - tained model for prediction
  8. nltk.txt -
  9. spam.csv - original dataset
  10. tranform_text.pkl - tranformed text file
  11. transform_df.csv - cleaned and transformed dataset

Analysis Report

Data Exploration

Upon initial exploration of the dataset, it was found that the dataset contains 5,574 rows and 2 columns. The columns in the dataset are labeled as "v1" and "v2", where "v1" represents the target variable indicating whether the message is spam or not, and "v2" represents the message itself. Upon further exploration, it was found that the dataset is imbalanced, with more not spam messages (ham) than spam messages.

Data Cleaning

In order to prepare the data for analysis, several data cleaning tasks were performed. Firstly, duplicate rows were removed from the dataset to ensure that each message in the dataset is unique. Secondly, the target column was converted to binary values, where 1 represents a spam message and 0 represents a not spam message.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) was performed on the cleaned dataset to better understand the characteristics of the spam and not spam messages. Three new features were created to represent the number of characters, words, and sentences in each message. The distributions of these new features were then visualized using histograms and pie charts. The EDA suggested that spam messages tend to have more characters, words, and sentences than not spam messages.

Model Report

This report details the process of training and evaluating a machine learning model for classifying SMS and email messages as spam or not spam. The dataset used for this analysis is the SMS Spam Collection dataset from the UCI Machine Learning Repository, which contains 5,574 SMS messages and labels indicating whether each message is spam or not.

Data Preprocessing

The first step in the model building process was to preprocess the data. This included removing duplicate rows, converting the target column to binary values (1 for spam, 0 for not spam), and performing natural language processing techniques on the message text, such as tokenization, removal of special characters, removal of stop words, and stemming.

Model Building

The model building process began by trying different machine learning algorithms, starting with Naive Bayes. Naive Bayes is known to perform well on textual data, so it was selected as the first algorithm to try.

To make the text data suitable for the model, the text was converted to numerical data using vectorization techniques such as Bag of Words and TF-IDF. The best results were obtained using the BernoulliNB algorithm, which achieved 97% accuracy and 97% precision score. However, this model was found to be predicting not spam messages as spam, which is not desirable.

Model Evaluation

The evaluation metrics used were accuracy, precision, recall, and F1 score. The main focus was on the precision score, as the goal of the model is to minimize the number of false positive predictions (i.e. not spam messages classified as spam). The BernoulliNB model had a higher accuracy of 97%, but it's not much important in this case as the precision score is also high.

The best results were obtained using the MultinomialNB algorithm, which achieved 95.9% accuracy and 100% precision score. This model correctly identified all not spam messages, minimizing the number of false positive predictions.

Conclusion

In conclusion, the trained MultinomialNB model demonstrated superior performance in classifying SMS and email messages as spam or not spam, as it achieved a precision score of 100%. This suggests that the MultinomialNB model is an effective method for classifying SMS and email messages as spam or not spam and minimizing the number of false positive predictions. The BernoulliNB model had a higher accuracy, but it's not much important in this case as the precision score is also high.

Snapshots

image image image image image image image image