<a href="https://colab.research.google.com/github/tilakrvarma22/qrscan/blob/main/SpamSMSClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spam SMS classification using NLP
* It is supervised learning task of binary classification.
* Output must be spam or not spam
* We have to apply NLP as we data which contains text data.In NLP,we will learn on how to convert that text to machine understandable code.That will give us desired output.

## 1. Data Extraction
* Data File has been downloaded from the https://fahadhussaincs.blogspot.com/p/nlp-deep-nlp_9.html
where we have to download tutorial-18_20 file pdf.
* Google colab stores only your google colab notebooks and all the remaining data like files are directly stores in drive.
* As to take file from local computer,we have use `files` module that has upload function which directly uploads the file from the computer.

In [None]:
from google.colab import files

In [None]:
uploaded=files.upload()

for fn in uploaded.keys():
  print('User Uploaded file"{name}"with length {length} bytes'.format(name=fn,length=len(uploaded[fn])))

Saving smsspamcollection.tsv to smsspamcollection.tsv
User Uploaded file"smsspamcollection.tsv"with length 513887 bytes


**Note:** `.tsv` It is a text format that stores data in table structure.

In [None]:
import numpy as np
import pandas as pd

In [None]:
train=pd.read_csv("smsspamcollection.tsv",sep="\t")
train.head()

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2


In [None]:
train.to_csv("spam_sms_classification.csv")

In [None]:
train["message"][2]

"Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"

In [None]:
len(train)

5572

In [None]:
train.isnull().sum()

label      0
message    0
length     0
punct      0
dtype: int64

In [None]:
train.message.unique()

array(['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
       'Ok lar... Joking wif u oni...',
       "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
       ..., 'Pity, * was in mood for that. So...any other suggestions?',
       "The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free",
       'Rofl. Its true to its name'], dtype=object)

In [None]:
train.message.value_counts()

Sorry, I'll call later                                                                                                                                         30
I cant pick the phone right now. Pls send a message                                                                                                            12
Ok...                                                                                                                                                          10
Okie                                                                                                                                                            4
Your opinion about me? 1. Over 2. Jada 3. Kusruthi 4. Lovable 5. Silent 6. Spl character 7. Not matured 8. Stylish 9. Simple Pls reply..                        4
                                                                                                                                                               ..
No. On the way home. So if n

In [None]:
train['label']

0        ham
1        ham
2       spam
3        ham
4        ham
        ... 
5567    spam
5568     ham
5569     ham
5570     ham
5571     ham
Name: label, Length: 5572, dtype: object

In [None]:
#Create a feature and label set
X=train[['length','punct']]
y=train['label']

In [None]:
X

Unnamed: 0,length,punct
0,111,9
1,29,6
2,155,6
3,49,6
4,61,2
...,...,...
5567,160,8
5568,36,1
5569,57,7
5570,125,1


In [None]:
train['message'][0]

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [None]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=1)

print("Training Data Shape:",X_train.shape)
print("Testing Data Shape:",X_train.shape)

Training Data Shape: (4457, 2)
Testing Data Shape: (4457, 2)


In [None]:
from sklearn.svm import SVC
lr_model=SVC(gamma='auto')
lr_model.fit(X_train,y_train)


#### Before TfIdf

In [None]:

predictions=lr_model.predict(X_test)

In [None]:
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[925  43]
 [ 78  69]]


In [None]:
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         ham       0.92      0.96      0.94       968
        spam       0.62      0.47      0.53       147

    accuracy                           0.89      1115
   macro avg       0.77      0.71      0.74      1115
weighted avg       0.88      0.89      0.89      1115



In [None]:
print("Accuracy of SVC:",metrics.accuracy_score(y_test,predictions))

Accuracy of SVC: 0.8914798206278027


In [None]:
## 4. Lets Boost accuracy using feature extraction of nlp in the sample dataset using
X=train["message"]
y=train["label"]

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1)

In [None]:
X_train.shape

(3900,)

### What is **tfidftransformer**?
* Ans: Its is tool used in NLP and information retrival to convert a collection of documents into numerical feature vectors.
* This transformer helps represents the importance of words in the context of a document and across collection of documents.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer


In [None]:
vector=CountVectorizer()

X_train_counts=vector.fit_transform(X_train)

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer=TfidfTransformer()
X_train_tfidf=tfidf_transformer.fit_transform(X_train_counts)

In [None]:
X_train_tfidf.shape

(3900, 7155)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
text_clf=Pipeline(steps=[("tfidf",TfidfVectorizer()),
                         ("clf",SVC())])
text_clf.fit(X_train,y_train)

In [None]:
predictions=text_clf.predict(X_test)

In [None]:
print(metrics.confusion_matrix(y_test,predictions))

[[1439    3]
 [  30  200]]


In [None]:
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         ham       0.98      1.00      0.99      1442
        spam       0.99      0.87      0.92       230

    accuracy                           0.98      1672
   macro avg       0.98      0.93      0.96      1672
weighted avg       0.98      0.98      0.98      1672



In [None]:
print("Accuracy of SVC using Tf-Idf:",metrics.accuracy_score(y_test,predictions))

Accuracy of SVC using Tf-Idf: 0.9802631578947368


### Bag of Words

In [None]:
train.message[0]

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [None]:
text=train.message[0]
text

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [None]:
text=text.lower()

In [None]:
text

'go until jurong point, crazy.. available only in bugis n great world la e buffet... cine there got amore wat...'

In [None]:
text.split()

['go',
 'until',
 'jurong',
 'point,',
 'crazy..',
 'available',
 'only',
 'in',
 'bugis',
 'n',
 'great',
 'world',
 'la',
 'e',
 'buffet...',
 'cine',
 'there',
 'got',
 'amore',
 'wat...']

In [None]:
from collections import Counter

In [None]:
print("Bag of Words",Counter(text.split()))

Bag of Words Counter({'go': 1, 'until': 1, 'jurong': 1, 'point,': 1, 'crazy..': 1, 'available': 1, 'only': 1, 'in': 1, 'bugis': 1, 'n': 1, 'great': 1, 'world': 1, 'la': 1, 'e': 1, 'buffet...': 1, 'cine': 1, 'there': 1, 'got': 1, 'amore': 1, 'wat...': 1})


### N-grams

In [None]:
def generate_ngrams(text, n):
    # Tokenize text by splitting on whitespace
    tokens = text.split()
    # Initialize an empty list to store n-grams
    ngrams = []
    # Generate n-grams
    for i in range(len(tokens) - n + 1):
        ngram = " ".join(tokens[i:i+n])
        ngrams.append(ngram)
    return ngrams

def main():
    # Sample text data
    text = "This is a sample sentence for generating n-grams.This is used for generating a n-grams."
    # Set the value of n for n-grams
    n = 2
    # Generate n-grams
    ngrams = generate_ngrams(text, n)
    # Print the n-grams
    print(f"{n}-grams:")
    for ngram in ngrams:
        print(ngram)

if __name__ == "__main__":
    main()


2-grams:
This is
is a
a sample
sample sentence
sentence for
for generating
generating n-grams.This
n-grams.This is
is used
used for
for generating
generating a
a n-grams.
