<a href="https://colab.research.google.com/github/tanmayb104/SpamClassifier/blob/main/SpamClassifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [36]:
import pandas as pd
import nltk

In [37]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

**Read Dataset**

In [38]:
messages = pd.read_csv('sample_data/SMSSpamCollection', sep = '\t', names = ["label", "message"])
messages.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


**Clean the dataset**

In [39]:
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [40]:
ps = PorterStemmer()
wordnet = WordNetLemmatizer()

In [41]:
corpus = []

In [42]:
for i in range(len(messages)):
    review = re.sub('[^a-zA-Z]', ' ', messages['message'][i])
    review = review.lower()
    review = review.split()
    review = [wordnet.lemmatize(word) for word in review if word not in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

**Bag Of Words**

In [43]:
from sklearn.feature_extraction.text import CountVectorizer
## max_feature --> HyperParameter ;)
cv = CountVectorizer(max_features=2500)
X = cv.fit_transform(corpus).toarray()

y = pd.get_dummies(messages['label'])
y = y.iloc[:, 1].values


**Tf-Idf**

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer

cv = TfidfVectorizer(max_features=2500)
X1 = cv.fit_transform(corpus).toarray()

y1 = pd.get_dummies(messages['label'])
y1 = y1.iloc[:, 1].values

**Confusion matrix and accuracy**

In [45]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

def calculate(y_test,y_pred):
  confusion_m = confusion_matrix(y_test, y_pred)
  accuracy = accuracy_score(y_test, y_pred)
  print(confusion_m)
  print(accuracy)

**Split data into training and testing**

In [46]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

**Naive Bayes**

In [47]:
from sklearn.naive_bayes import MultinomialNB

spam_detect_model = MultinomialNB().fit(X_train, y_train)
y_pred = spam_detect_model.predict(X_test)
calculate(y_test,y_pred)

[[946   9]
 [ 10 150]]
0.9829596412556054


**Logistic Regression**

In [48]:
from sklearn.linear_model import LogisticRegression

logisticR = LogisticRegression()
logisticR.fit(X_train, y_train)
y_pred = logisticR.predict(X_test)
calculate(y_test,y_pred)

[[955   0]
 [ 15 145]]
0.9865470852017937


**Random Forest Classifier**

In [49]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
calculate(y_test,y_pred)

[[955   0]
 [ 16 144]]
0.9856502242152466
