# Spam Message Classification

### preprocessing the data

In [4]:
import pandas as pd

In [5]:
df=pd.read_table("https://raw.githubusercontent.com/arib168/data/main/spam.tsv")
df

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2
...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,160,8
5568,ham,Will ü b going to esplanade fr home?,36,1
5569,ham,"Pity, * was in mood for that. So...any other s...",57,7
5570,ham,The guy did some bitching but I acted like i'd...,125,1


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    5572 non-null   object
 1   message  5572 non-null   object
 2   length   5572 non-null   int64 
 3   punct    5572 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 174.2+ KB


In [7]:
df['label'].value_counts() # counts the total no. of times each value is repeated inside a column

ham     4825
spam     747
Name: label, dtype: int64

In [8]:
df['message'][5] # prints the message from 5th row

"FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"

In [9]:
df['label'][5] # prints the label from 5th row

'spam'

### I/O

In [10]:
x = df['message'].values
y = df['label'].values

### train-test-split

In [11]:
from sklearn.model_selection import train_test_split 
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=0)

In [12]:
x_train.size

4179

In [13]:
y_train.size

4179

In [14]:
x_test.size

1393

In [15]:
y_test.size

1393

### Applying the count vectorizer

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

In [17]:
from sklearn.feature_extraction.text import CountVectorizer 
count_vect = CountVectorizer(stop_words='english')
x_train_vect = count_vect.fit_transform(x_train)
x_test_vect = count_vect.transform(x_test)

In [18]:
x_train_vect.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

### Support Vector Machine

In [19]:
from sklearn.svm import SVC

In [20]:
model1=SVC()

In [21]:
model1.fit(x_train_vect,y_train)

SVC()

In [22]:
y_pred=model1.predict(x_test_vect)
y_pred

array(['ham', 'spam', 'ham', ..., 'ham', 'ham', 'ham'], dtype=object)

### Accuracy

In [23]:
from sklearn.metrics import accuracy_score

In [24]:
accuracy_score(y_test,y_pred)

0.9813352476669059

- If we want to chech the output for any desired input message we cannot do this:  model.predict("Free coupons") since the model only accepts the numerical data. So, first we need to convert the text input into array using "count vectorizer"

- In order to overcome the above problem we use a concept called "PIPELINE". It combines both CountVectorizer() and SVC()

### SVM with Pipeline

In [25]:
from sklearn.pipeline import make_pipeline

In [26]:
model2=make_pipeline(CountVectorizer(), SVC()) # loading the make_pipeline into 'model2' variable 

### Training using make_pipeline

In [27]:
model2.fit(x_train,y_train) # here i did not us x_train_vect  

Pipeline(steps=[('countvectorizer', CountVectorizer()), ('svc', SVC())])

In [28]:
y_pred2 = model2.predict(x_test)
y_pred2

array(['ham', 'spam', 'ham', ..., 'ham', 'ham', 'ham'], dtype=object)

In [29]:
accuracy_score(y_pred2, y_test)

0.9834888729361091

- The accuracy of the SVM model has increased after using "pipeline"

### Naive Bayes

solving the above problem using Naive Bayes

It uses Bayes Theorem

In [30]:
from sklearn.naive_bayes import MultinomialNB

In [31]:
model3=MultinomialNB()

In [32]:
model3.fit(x_train_vect,y_train)

MultinomialNB()

In [33]:
y_pred3=model3.predict(x_test_vect)
y_pred

array(['ham', 'spam', 'ham', ..., 'ham', 'ham', 'ham'], dtype=object)

### Accuracy

In [34]:
accuracy_score(y_pred3,y_test)

0.9863603732950467

### Naive Bayes with Pipeline

In [35]:
model4=make_pipeline(CountVectorizer(),MultinomialNB())

In [36]:
model4.fit(x_train,y_train)

Pipeline(steps=[('countvectorizer', CountVectorizer()),
                ('multinomialnb', MultinomialNB())])

In [37]:
y_pred4= model4.predict(x_test)
y_pred4

array(['ham', 'spam', 'ham', ..., 'spam', 'ham', 'ham'], dtype='<U4')

### Accuracy

In [38]:
accuracy_score(y_pred4,y_test)

0.9885139985642498

- The accuracy of the Naive Bayes model has increased after using "pipeline"

### Naive Bayes with Pipeline has the highest accuracy of 9.885

### Saving the model with highest accuracy using "joblib"

In [45]:
import joblib
joblib.dump(model4, "spam filtering")  # saving the model in a file "spam filtering"


['spam filtering']

### loading the saved model

In [43]:
import joblib
saved_model=joblib.load("spam filtering")
saved_model


Pipeline(steps=[('countvectorizer', CountVectorizer()),
                ('multinomialnb', MultinomialNB())])

In [44]:
saved_model.predict(["Free entry to the concert"])

array(['spam'], dtype='<U4')