In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split 

## Importing Dataset

In [11]:
dataset = pd.read_csv("spam_ham_dataset.csv")
dataset.head()

Unnamed: 0.1,Unnamed: 0,label,text
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see..."
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar..."
3,4685,spam,"Subject: photoshop , windows , office . cheap ..."
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...


## Data Exploration

In [12]:
## Let's Do some data exploration. We can group the data by catagory name,
## In this case we have two catagories, "label", "text"!

dataset.groupby('label').describe()

Unnamed: 0_level_0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0,Unnamed: 0
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
ham,3672.0,1835.5,1060.159422,0.0,917.75,1835.5,2753.25,3671.0
spam,1499.0,4421.0,432.86834,3672.0,4046.5,4421.0,4795.5,5170.0



Important thing to remember, machine learning models only understand numbers not textual data (strings).
Whenever we deal with text-based data columns, We need to convert them into numbers.

Let's Convert the "label" column in number format. (0,1)
Here, apply() funtion can be used for the purpose.



In [13]:

dataset['spam'] = dataset['label'].apply(lambda x: 1 if x=='spam' else 0)
dataset.head()

# Here 'Lambda' function takes each & every value of the 'label' column & checks if 
# Value is 'spam' it will return 1 else 0 (in case of 'ham' return value will be '0')

Unnamed: 0.1,Unnamed: 0,label,text,spam
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


 Usually when we are dealing with complex datasets, We avoid to train the model with entire dataset
 It is not a good practice, Good strategy is to 'Split' the data into 2 parts.

  Part_1: Can be used for training the model! 
  
  Part_2: Can be used for Testing purpose.

In [14]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(dataset.text,dataset.spam, test_size=0.30)


# Here train_test_split is used to split the model into two parts, part:1 training and part_2: testing.
# test_size: 0.3 mean that we want to train the model with 0.7% of the data, remaining will be used for Testing purposes.


In [15]:
dataset.label

0        ham
1        ham
2        ham
3       spam
4        ham
        ... 
5166     ham
5167     ham
5168     ham
5169     ham
5170    spam
Name: label, Length: 5171, dtype: object

In [16]:
# Output will show the textual data, that later we need to convert in 'Numbers' for further usage!

dataset.text

0       Subject: enron methanol ; meter # : 988291\r\n...
1       Subject: hpl nom for january 9 , 2001\r\n( see...
2       Subject: neon retreat\r\nho ho ho , we ' re ar...
3       Subject: photoshop , windows , office . cheap ...
4       Subject: re : indian springs\r\nthis deal is t...
                              ...                        
5166    Subject: put the 10 on the ft\r\nthe transport...
5167    Subject: 3 / 4 / 2000 and following noms\r\nhp...
5168    Subject: calpine daily gas nomination\r\n>\r\n...
5169    Subject: industrial worksheets for august 2000...
5170    Subject: important online banking alert\r\ndea...
Name: text, Length: 5171, dtype: object

In [17]:
# Remeber? we needed to convert the 'text columns' into numbers somehow? Let's do with "test" column.

## Training and Testing

 'test' column contains huge amount of textual data, unlike 'label' column where it has two lable names 'spam' & 'ham' 
 To start We can asign each letter a number value but in real world it is not a sufficient technique.
 For this column, we will use "Count Vectorizer Technique". Where we will find unique words and treat them as a features.

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
X_train_count = v.fit_transform(X_train.values)
X_train_count.toarray()[:4]

# This code is creating a matrix, these numbers are representation of unique words in our data set
# sklearn CountVectorizer will convert email text into numbers (Matrix representation).

array([[3, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

'text' column has large amount of data, to convert that textual data we will use "MultinomialNB".
 MultinomialNB is used for handling concrete data (i.e, movie rating with rotten tomattos %age etc),
 
 In Textual form we have the count of each word to predict the label or class.

In [19]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
model.fit(X_train_count,y_train)

#  Here we will run 'model.fit()' function on  X_train_count,y_train
# "X_train_count" contains the text that was converted into number metrix.

MultinomialNB()

## Evaluation

In [20]:
# Let's do some evaluation testing of our model.

emails = [
    'subject: Dear Usman, Lets Play football?',
    'subject: 1 billion Nok lottery, for doing nothing, Dont miss this reward!',
]

emails_count = v.transform(emails)
model.predict(emails_count)

# Return = 1 means Spam!
# Return = 0 means Ham!

array([0, 1], dtype=int64)

In [21]:
dataset.head()

Unnamed: 0.1,Unnamed: 0,label,text,spam
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0


## Accuracy Count

In [22]:
#Accuracy Test

X_test_count = v.transform(X_test)
model.score(X_test_count, y_test)

# (X_test) we need to transform/convert it to number count, As our model only works on numbers not on strings.
# Than we are feeding it to the model.

0.9768041237113402

## In this example, Our Accuracy test is above 95%.