<a href="https://colab.research.google.com/github/sanika-mhadgut/NLP/blob/master/SMS_Spam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Name : Sanika Mhadgut
#Branch : Btech Data Science Sem 6
#Roll No: J031

Blacklisting is a technique that identifies IP addresses that send large amounts of spam. This method uses the number of recipients to determine if an email is spam or not. However, many legitimate emails can have high traffic volumes. Scanning message headings is a fairly reliable way to detect spam.



Text mining (deriving information from text) is a wide field which has gained popularity with the huge text data being generated. Automation of a number of applications like sentiment analysis, document classification, topic classification, text summarization, machine translation, etc has been done using machine learning models.

In [0]:
#Importing libraries
import pandas as pd
import zipfile
import re

In [0]:
zf = zipfile.ZipFile('smsspamcollection.zip')

In [0]:
zf

<zipfile.ZipFile filename='smsspamcollection.zip' mode='r'>

In [0]:
data = pd.read_csv(zf.open('SMSSpamCollection'), sep='\t',names=["label", "message"])

In [0]:
data['message']

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                 Will ü b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: message, Length: 5572, dtype: object

a) Removal of stop words – Stop words like “and”, “the”, “of”, etc are very common in all English sentences and are not very meaningful in deciding spam or legitimate status, so these words have been removed from the emails.

b) Lemmatization – It is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. For example, “include”, “includes,” and “included” would all be represented as “include”. The context of the sentence is also preserved in lemmatization as opposed to stemming (another buzz word in text mining which does not consider meaning of the sentence).

In [0]:
#Function that preprocesses the data for countvectoizer
def Preprocessing(text):
  #Initializing a dataframe of stopwords
  stop_words = pd.read_csv("https://gist.githubusercontent.com/sebleier/554280/raw/7e0e4a1ce04c2bb7bd41089c9821dbcf6d0c786c/NLTK's%2520list%2520of%2520english%2520stopwords")
  stop_words.values.tolist()

  #Considering only alpha-numeric characters, numeric digits not even considered
  text = re.findall(r'\b[A-Za-z]\w+\b', text)
  
  #Remove stopwords
  for word in text:
    if word in stop_words:
      text.remove(word)
      
  #Remove numerics
  #res = ''.join([i for i in text if not i.isdigit()]) 

  #Remove @ and #
  bad_chars = ['@', '#']
  for i in text:
    if i in bad_chars: 
      text = text.remove(i)
  
  text = list(text)

  #Converting list to string
  text = ' '.join(word for word in text)
  return text

Dictionary can be seen by the command print dictionary. You may find some absurd word counts to be high but don’t worry, it’s just a dictionary and you always have the scope of  improving it later. If you are following this blog with provided data-set, make sure your dictionary has some of the entries given below as most frequent words. Here I have chosen 3000 most frequently used words in the dictionary.

In [0]:
Preprocessing(data['message'][0])

'Go until jurong point crazy Available only in bugis great world la buffet Cine there got amore wat'

In [0]:
data['preprocessed_text'] = data['message'].apply(Preprocessing,1)

In [0]:
data

Unnamed: 0,label,message,preprocessed_text
0,ham,"Go until jurong point, crazy.. Available only ...",Go until jurong point crazy Available only in ...
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in wkly comp to win FA Cup final tk...
3,ham,U dun say so early hor... U c already then say...,dun say so early hor already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...",Nah don think he goes to usf he lives around h...
...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,This is the time we have tried contact have wo...
5568,ham,Will ü b going to esplanade fr home?,Will going to esplanade fr home
5569,ham,"Pity, * was in mood for that. So...any other s...",Pity was in mood for that So any other suggest...
5570,ham,The guy did some bitching but I acted like i'd...,The guy did some bitching but acted like be in...


In [0]:
corpus = data['preprocessed_text'] 
corpus

0       Go until jurong point crazy Available only in ...
1                                   Ok lar Joking wif oni
2       Free entry in wkly comp to win FA Cup final tk...
3                   dun say so early hor already then say
4       Nah don think he goes to usf he lives around h...
                              ...                        
5567    This is the time we have tried contact have wo...
5568                      Will going to esplanade fr home
5569    Pity was in mood for that So any other suggest...
5570    The guy did some bitching but acted like be in...
5571                            Rofl Its true to its name
Name: preprocessed_text, Length: 5572, dtype: object

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [0]:
vectorize= TfidfVectorizer()

In [0]:
#fitting the model and passing our sentences right away:
response= vectorize.fit_transform(corpus)

In [0]:
print(response)

  (0, 7473)	0.18241264829651851
  (0, 231)	0.32647198856800297
  (0, 2764)	0.15305130991688437
  (0, 6843)	0.1555161950550194
  (0, 1210)	0.27580485521143805
  (0, 914)	0.3116528020516887
  (0, 3650)	0.27580485521143805
  (0, 7697)	0.22083291550052703
  (0, 2804)	0.18034330636364296
  (0, 916)	0.27580485521143805
  (0, 3263)	0.1069931616636402
  (0, 4720)	0.15602976712614566
  (0, 465)	0.24419040033995174
  (0, 1487)	0.25283008183768235
  (0, 5107)	0.25535167546045223
  (0, 3523)	0.32647198856800297
  (0, 7233)	0.23001810878216972
  (0, 2720)	0.14787418026870422
  (1, 4716)	0.5466243141314314
  (1, 7599)	0.43162957585464123
  (1, 3491)	0.5236804332035243
  (1, 3686)	0.4083258549263009
  (1, 4687)	0.2718944069420321
  (2, 318)	0.1849800442797567
  (2, 5433)	0.1849800442797567
  :	:
  (5570, 951)	0.2830195559272965
  (5570, 2638)	0.2753741356210262
  (5570, 2054)	0.24409532149526603
  (5570, 6248)	0.20541570643501375
  (5570, 943)	0.13633148775441348
  (5570, 7269)	0.2088816708041
  (557

In [0]:
df['tweetsVect'] = list(vectoriser.fit_transform(df['tweets']).toarray())