SMS/Email Spam Classification WebApp

View Demo · Report Bug · Request Feature · Ask Question

Overview

This analysis was conducted to classify SMS and email messages as spam or not spam using machine learning techniques. The dataset used for this analysis is the SMS Spam Collection dataset from the Kaggle, which contains 5,574 SMS messages and labels indicating whether each message is spam or not.

The first step in the project was to preprocess the data by removing duplicate rows, converting the target column to binary values, and performing natural language processing techniques on the message text. The model building process began by trying different machine learning algorithms, starting with Naive Bayes. To make the text data suitable for the model, the text was converted to numerical data using vectorization techniques such as Bag of Words and TF-IDF.

The evaluation metrics used were accuracy, precision, recall, and F1 score. The main focus was on the precision score, as the goal of the model is to minimize the number of false positive predictions (i.e. not spam messages classified as spam). The best results were obtained using the MultinomialNB algorithm, which achieved 95.9% accuracy and 100% precision score.

In conclusion, the MultinomialNB model demonstrated superior performance in classifying SMS and email messages as spam or not spam, as it achieved a precision score of 100%. This suggests that the MultinomialNB model is an effective method for classifying SMS and email messages as spam or not spam and minimizing the number of false positive predictions.

Project Files/Requirements

assests - assests folder
app.py - main python file
prediction.py - for new prediction
analysis.py - for analysis charts page
requirements.txt - all required packages list file
Procfile - require for heroku deployment
model.pkl - tained model for prediction
nltk.txt -
spam.csv - original dataset
tranform_text.pkl - tranformed text file
transform_df.csv - cleaned and transformed dataset

Analysis Report

Data Exploration

Upon initial exploration of the dataset, it was found that the dataset contains 5,574 rows and 2 columns. The columns in the dataset are labeled as "v1" and "v2", where "v1" represents the target variable indicating whether the message is spam or not, and "v2" represents the message itself. Upon further exploration, it was found that the dataset is imbalanced, with more not spam messages (ham) than spam messages.

Data Cleaning

In order to prepare the data for analysis, several data cleaning tasks were performed. Firstly, duplicate rows were removed from the dataset to ensure that each message in the dataset is unique. Secondly, the target column was converted to binary values, where 1 represents a spam message and 0 represents a not spam message.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) was performed on the cleaned dataset to better understand the characteristics of the spam and not spam messages. Three new features were created to represent the number of characters, words, and sentences in each message. The distributions of these new features were then visualized using histograms and pie charts. The EDA suggested that spam messages tend to have more characters, words, and sentences than not spam messages.

Model Report

This report details the process of training and evaluating a machine learning model for classifying SMS and email messages as spam or not spam. The dataset used for this analysis is the SMS Spam Collection dataset from the UCI Machine Learning Repository, which contains 5,574 SMS messages and labels indicating whether each message is spam or not.

Data Preprocessing

The first step in the model building process was to preprocess the data. This included removing duplicate rows, converting the target column to binary values (1 for spam, 0 for not spam), and performing natural language processing techniques on the message text, such as tokenization, removal of special characters, removal of stop words, and stemming.

Model Building

The model building process began by trying different machine learning algorithms, starting with Naive Bayes. Naive Bayes is known to perform well on textual data, so it was selected as the first algorithm to try.

To make the text data suitable for the model, the text was converted to numerical data using vectorization techniques such as Bag of Words and TF-IDF. The best results were obtained using the BernoulliNB algorithm, which achieved 97% accuracy and 97% precision score. However, this model was found to be predicting not spam messages as spam, which is not desirable.

Model Evaluation

The evaluation metrics used were accuracy, precision, recall, and F1 score. The main focus was on the precision score, as the goal of the model is to minimize the number of false positive predictions (i.e. not spam messages classified as spam). The BernoulliNB model had a higher accuracy of 97%, but it's not much important in this case as the precision score is also high.

The best results were obtained using the MultinomialNB algorithm, which achieved 95.9% accuracy and 100% precision score. This model correctly identified all not spam messages, minimizing the number of false positive predictions.

Conclusion

In conclusion, the trained MultinomialNB model demonstrated superior performance in classifying SMS and email messages as spam or not spam, as it achieved a precision score of 100%. This suggests that the MultinomialNB model is an effective method for classifying SMS and email messages as spam or not spam and minimizing the number of false positive predictions. The BernoulliNB model had a higher accuracy, but it's not much important in this case as the precision score is also high.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.idea		.idea
assets		assets
.gitignore		.gitignore
LICENCE		LICENCE
Procfile		Procfile
README.md		README.md
analysis.py		analysis.py
app.py		app.py
model.pkl		model.pkl
nltk.txt		nltk.txt
prediction.py		prediction.py
requirements.txt		requirements.txt
spam.csv		spam.csv
text_prep.pkl		text_prep.pkl
transform_text.pkl		transform_text.pkl
transformed_df.csv		transformed_df.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SMS/Email Spam Classification WebApp

Overview

Project Files/Requirements

Analysis Report

Data Exploration

Data Cleaning

Exploratory Data Analysis

Model Report

Data Preprocessing

Model Building

Model Evaluation

Conclusion

Snapshots

About

Releases

Packages

Languages

License

santos-k/SMS-SPAM-Classifier

Folders and files

Latest commit

History

Repository files navigation

SMS/Email Spam Classification WebApp

Overview

Project Files/Requirements

Analysis Report

Data Exploration

Data Cleaning

Exploratory Data Analysis

Model Report

Data Preprocessing

Model Building

Model Evaluation

Conclusion

Snapshots

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages