# Spam Detection in Social Media Using Supervised Machine Learning

Social media spam from malicious content to unsolicited promotion is an issue that plagues all social networks. But it is a problem that Machine Learning can help solve. In this project, Naïve Bayes, Support Vector Machine, and Random Forest machine learning paradigms are compared, analyzed, and implemented to help combat this spam issue on the social network, Twitter. With reports of nearly 48 million bots on Twitter out of 319 million monthly active users in March 2017, the issue is quite rampant. The models will be trained on a labeled dataset of tweets and user data gathered and collected using Tweepy. The features of the data analyzed are the tweet itself and the number of followers and followings of the user. The labeled dataset will include a variety of spam messages and style in order to acknowledge that different spammers have varying techniques from spammer to spammer.


# The Data:
I began this project on a labeled dataset of text messages found [here](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)

The dataset labels spam messages as "spam" and non-spam as "ham". The labels were converted to 0 for "ham" and 1 for "spam"

In [None]:
import pandas as pd
#df = pd.read_table("/filelocation/SMSSpamCollection")
df.head()
df[df.columns[0]] = df[df.columns[0]].map({'ham':0, 'spam':1})
df.columns = ['label', 'sms_message']

# Data Splits
In order to prepare each algorithm the data needs to be split into a training and a testing set.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], df['label'], random_state=1)
training_data = count_vector.fit_transform(X_train)
testing_data = count_vector.transform(X_test)

# Analysis
All the algorithms used in this project will be evaluated in the categories of:
#### Accuracy:
The measurement of how often the correct prediction is made. It is the ratio of correct predictions to the total number of predictions. 
#### Precision:
The proportion of messages classified as spam that were actually spam. The ratio of True Positives/Total Positives
#### Recall:
The proportion of messages that were spam that were accurately labeled as spam. The ratio of True Positives/(True Positives + False Negatives)
#### F1 Score: 
The F1 score is the weighted average of the precision and recall, where the best value is at 1 and the worst at 0. 
F1 = 2* (precision * recall)/(precision + recall)


# Naive Bayes
The first paradigm I will be testing is Naive Bayes. Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features.

In [None]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data,y_train)
nbpredictions = naive_bayes.predict(testing_data)


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Naive Bayes Evaluation

In [None]:
print('Accuracy score: ', format(accuracy_score(y_test,nbpredictions)))
print('Precision score: ', format(precision_score(y_test,nbpredictions)))
print('Recall score: ', format(recall_score(y_test,nbpredictions)))
print('F1 score: ', format(f1_score(y_test,nbpredictions)))

Accuracy score:  0.9834888729361091
Precision score:  0.9555555555555556
Recall score:  0.9197860962566845 
F1 score:  0.9373297002724795 

# Support Vector Machine 
Support Vector Machines are based on the concept of decision planes that define decision boundaries.


In [None]:
from sklearn import svm
clf = svm.SVC(probability=True, C=1000)
clf.fit(training_data,y_train)
svmpredictions = clf.predict(testing_data)

# SVM Evaluation 

In [None]:
print('Accuracy score: ', format(accuracy_score(y_test,svmpredictions )))
print('Precision score: ', format(precision_score(y_test,svmpredictions )))
print('Recall score: ', format(recall_score(y_test,svmpredictions )))
print('F1 score: ', format(f1_score(y_test,svmpredictions)))

In [None]:
Accuracy score:  0.9870782483847811
Precision score:  0.9941520467836257
Recall score:  0.9090909090909091
F1 score:  0.9497206703910615