<a href="https://colab.research.google.com/github/richard-ky/spam-comments/blob/main/fake_news.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Spam Comment Classification using Machine Learning in Python</h1>

First, import the vectorizers so that we can create vectors using our training and test data.<br>Also, import the train/test split method, the classifier that we will use for our model, and metrics so that we can analyze our results.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

Next, we need to import our data. This spam comment dataset comes from the comment sections of various popular YouTube videos.<br>We will use `pandas` to work with the data.

In [None]:
import pandas as pd 
df = pd.read_csv('https://storage.googleapis.com/kagglesdsdata/datasets/141926/333383/Youtube01-Psy.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20221113%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20221113T035102Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=960d8321a988c07ac23a190e9a2346d0a403ee5dae523c22e656b04b62ca7ae88afcba564e83aee219c8532c76514d06521ce9401cbc50c60d0ee52c2ef9668565a7e40372b679e72766b8183d449fbdbb1d62e9d8f7af8cff4d77bafb79db2961b7ba77565e3534df63ef04c2f58e5877785acaab830247c15b57d20216d34083a719c335f6c0b13cee075d53036d8107f2a2cf6bf003b02ecfd4d4d8a2ad123358b31a6f58dcc7d1926efce7e794b2023bae4af854b8c36ec48a9a2a7373369cd5189fda53ef080940a1ebd17ead81f7bbdb459ac8f4e2508ec08be084fa6178a0cc060069d4d2e2a4e9da614a25e383971f6b2627c27c4be524cdff6e1aec')

We can take a cursory look at the data.

In [None]:
df[['CONTENT', 'CLASS']].head(3)

Unnamed: 0,CONTENT,CLASS
0,"Huh, anyway check out this you[tube] channel: ...",1
1,Hey guys check out my new channel and our firs...,1
2,just for test I have to say murdev.com,1


First, split the data into training and test datasets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.CONTENT, df.CLASS, test_size=.2)

Then, instantiate `CountVectorizer` (while getting rid of stop words), and make sure to transform the data into vectors.

In [None]:
count_vectorizer = CountVectorizer(stop_words='english')
count_train = count_vectorizer.fit_transform(X_train)
count_test = count_vectorizer.transform(X_test)

Next, instantiate `MultinomialNB`, and fit the model using the training data vector and the training labels.

In [None]:
nb_classifier = MultinomialNB()
nb_classifier.fit(count_train, y_train)

MultinomialNB()

Make predictions with our model using the test data vector, and check both the accuracy score and the confusion matrix. For the confusion matrix, the predictions correspond to the columns, and the ground truth corresponds to the rows. Running the code below should give us a pretty decent score.

In [None]:
pred = nb_classifier.predict(count_test)
score = metrics.accuracy_score(y_test, pred)
print(score)

0.8857142857142857


Now we can also check the confusion matrix, in which the predictions correspond to the columns, and the ground truth corresponds to the rows. In order from left to right and top to bottom, we are told the number of true negatives, false positives, false negatives, and true positives.

In [None]:
cm = metrics.confusion_matrix(y_test, pred)
print(cm)

[[40  7]
 [ 1 22]]


Finally, how about we try using `TfidfVectorizer` instead and comparing our results? Naive Bayes generally does better with integers than floats (and tf-idf will give us weights in the form of floats), but it cannot hurt to try.

In [None]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)
nb_classifier = MultinomialNB()
nb_classifier.fit(tfidf_train, y_train)
pred = nb_classifier.predict(tfidf_test)
score = metrics.accuracy_score(y_test, pred)
print(score)
cm = metrics.confusion_matrix(y_test, pred)
print(cm)

0.8428571428571429
[[37 10]
 [ 1 22]]


Was that score better? Either way, we can also try out different alpha values!

In [None]:
import numpy as np
alphas = np.arange(.1, 1, .1)
def train_and_predict(alpha):
  nb_classifier = MultinomialNB(alpha=alpha)
  nb_classifier.fit(tfidf_train, y_train)
  pred = nb_classifier.predict(tfidf_test)
  score = metrics.accuracy_score(y_test, pred)
  return score
for alpha in alphas:
  print('Alpha: ', alpha)
  print('Score: ', train_and_predict(alpha))
  print()

Alpha:  0.1
Score:  0.8571428571428571

Alpha:  0.2
Score:  0.8714285714285714

Alpha:  0.30000000000000004
Score:  0.8714285714285714

Alpha:  0.4
Score:  0.8714285714285714

Alpha:  0.5
Score:  0.8571428571428571

Alpha:  0.6
Score:  0.8428571428571429

Alpha:  0.7000000000000001
Score:  0.8428571428571429

Alpha:  0.8
Score:  0.8428571428571429

Alpha:  0.9
Score:  0.8428571428571429



That is all for now! Thanks for checking out my work!