# Assignment - SMS Spam Collection

Take dataset from UCI Machine Learning repositoy any Text dataset and perform

- Bag of Words 

- TFIDF
- Ngram

## Dataset used

SMS Spam Collection - https://archive.ics.uci.edu/dataset/228/sms+spam+collection (Fetched from UCI Machine Learning repository)

## Importing libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Text preprocessing libraries
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

## Importing dataset using Pandas

In [8]:
df = pd.read_csv('SMSSpamCollection', sep='\t', names=['Output', 'message'])

## Checking the dataframe

In [9]:
df.head()

Unnamed: 0,Output,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [10]:
df.tail()

Unnamed: 0,Output,message
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...
5571,ham,Rofl. Its true to its name


## Checking if dataset has null values or not

In [11]:
df.isnull().sum()

Output     0
message    0
dtype: int64

## Separating the Output variable

In [12]:
y = df['Output']

In [13]:
y

0        ham
1        ham
2       spam
3        ham
4        ham
        ... 
5567    spam
5568     ham
5569     ham
5570     ham
5571     ham
Name: Output, Length: 5572, dtype: object

## Removing the Output variable 

In [14]:
df.drop('Output', axis=1, inplace=True)

# Bag of Words model

Before proceeding with training of Bag of words model, let us first preprocess the text data like

- Lowering the sentences

- Removing special characters
- Performing Stemming (as its Text Classification)


## Converting the data frame to List

In [15]:
sentences = df.values.tolist() # converting the remaining dataframe to list

## Seperating the Output variable

In [16]:
sentences

[['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'],
 ['Ok lar... Joking wif u oni...'],
 ["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"],
 ['U dun say so early hor... U c already then say...'],
 ["Nah I don't think he goes to usf, he lives around here though"],
 ["FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"],
 ['Even my brother is not like to speak with me. They treat me like aids patent.'],
 ["As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune"],
 ['WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours 

## Reducing the dimension of sentences array from 2D to 1D

In [17]:
sentences = np.concatenate(sentences, axis=0).tolist() # converting the list of lists to a single list

In [18]:
sentences

['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
 'Ok lar... Joking wif u oni...',
 "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
 'U dun say so early hor... U c already then say...',
 "Nah I don't think he goes to usf, he lives around here though",
 "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv",
 'Even my brother is not like to speak with me. They treat me like aids patent.',
 "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune",
 'WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.',
 'Had you

## Preprocessing the Sentences 

In [19]:
import re

stemmer = PorterStemmer() # creating an instance of the PorterStemmer class
lemmatizer = WordNetLemmatizer() # creating an instance of the WordNetLemmatizer class

pre_processed_sentences = []

for sentence in sentences:

    review = sentence.lower() # converting the sentence to lowercase
    review = re.sub(r'[^a-zA-Z]', ' ', review) # removing special characters in the sentence
    review = review.split() # splitting the sentence into words
    review = [lemmatizer.lemmatize(word) for word in review if word not in set(stopwords.words('English'))] # lemmatizing the words and removing stopwords
    review = ' '.join(review) # joining the words back to form a sentence

    pre_processed_sentences.append(review)

In [20]:
pre_processed_sentences

['go jurong point crazy available bugis n great world la e buffet cine got amore wat',
 'ok lar joking wif u oni',
 'free entry wkly comp win fa cup final tkts st may text fa receive entry question std txt rate c apply',
 'u dun say early hor u c already say',
 'nah think go usf life around though',
 'freemsg hey darling week word back like fun still tb ok xxx std chgs send rcv',
 'even brother like speak treat like aid patent',
 'per request melle melle oru minnaminunginte nurungu vettam set callertune caller press copy friend callertune',
 'winner valued network customer selected receivea prize reward claim call claim code kl valid hour',
 'mobile month u r entitled update latest colour mobile camera free call mobile update co free',
 'gonna home soon want talk stuff anymore tonight k cried enough today',
 'six chance win cash pound txt csh send cost p day day tsandcs apply reply hl info',
 'urgent week free membership prize jackpot txt word claim c www dbuk net lccltd pobox ldnw rw'

## Training Bag of Words model

In [21]:
from sklearn.feature_extraction.text import CountVectorizer # importing library from sklearn to convert text to numbers

In [27]:
cv = CountVectorizer(ngram_range=(1,3)) # creating an instance of the CountVectorizer class along with ngrams

In [23]:
X = cv.fit_transform(pre_processed_sentences) # fitting the pre-processed sentences to the CountVectorizer instance

In [24]:
cv.vocabulary_

{'go': 21167,
 'jurong': 28717,
 'point': 43735,
 'crazy': 11211,
 'available': 3269,
 'bugis': 6335,
 'great': 23017,
 'world': 64870,
 'la': 29830,
 'buffet': 6327,
 'cine': 8894,
 'got': 22482,
 'amore': 1681,
 'wat': 62442,
 'go jurong': 21371,
 'jurong point': 28718,
 'point crazy': 43740,
 'crazy available': 11214,
 'available bugis': 3272,
 'bugis great': 6338,
 'great world': 23156,
 'world la': 64893,
 'la buffet': 29831,
 'buffet cine': 6328,
 'cine got': 8905,
 'got amore': 22487,
 'amore wat': 1682,
 'go jurong point': 21372,
 'jurong point crazy': 28719,
 'point crazy available': 43741,
 'crazy available bugis': 11215,
 'available bugis great': 3273,
 'bugis great world': 6339,
 'great world la': 23157,
 'world la buffet': 64894,
 'la buffet cine': 29832,
 'buffet cine got': 6329,
 'cine got amore': 8906,
 'got amore wat': 22488,
 'ok': 40105,
 'lar': 30002,
 'joking': 28587,
 'wif': 63809,
 'oni': 40872,
 'ok lar': 40235,
 'lar joking': 30024,
 'joking wif': 28594,
 'wif 

In [25]:
pre_processed_sentences[0]

'go jurong point crazy available bugis n great world la e buffet cine got amore wat'

In [26]:
X[0].toarray()

array([[0, 0, 0, ..., 0, 0, 0]])

## Training TFIDF model

### Importing libraries required

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer # importing library from sklearn to convert text to numbers

tfidf = TfidfVectorizer(ngram_range=(1,1))

### Training the model

In [29]:
X = tfidf.fit_transform(pre_processed_sentences)

In [30]:
pre_processed_sentences[0]

'go jurong point crazy available bugis n great world la e buffet cine got amore wat'

In [31]:
X[0].toarray()

array([[0., 0., 0., ..., 0., 0., 0.]])