In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import _stop_words
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn import set_config
set_config(display='diagram')

# Problem 2: Email spam detection

Nearly every email user has at some point encountered a "spam" email, which is an unsolicited message often advertising a product, containing links to malware, or attempting to scam the recipient. 
Roughly 80-90% of more than 100 billion emails sent each day are spam emails, most being sent from botnets of malware-infected computers. 
The remainder of emails are called "ham" emails.

In [2]:
# load the data
url = 'https://raw.githubusercontent.com/um-perez-alvaro/Data-Science-Practice/master/Data/emails.csv'
emails = pd.read_csv(url)
emails.head()

Unnamed: 0,text,label
0,"Date: Wed, 21 Aug 2002 10:54:46 -05...",ham
1,"Martin A posted:\nTassos Papadopoulos, the Gre...",ham
2,Man Threatens Explosion In Moscow \n\nThursday...,ham
3,Klez: The Virus That Won't Die\n \nAlready the...,ham
4,"> in adding cream to spaghetti carbonara, whi...",ham


There are 3000 emails in the dataset

In [3]:
len(emails)

3000

2500 are ham emails, and 500 are spam

In [4]:
emails.label.value_counts()

ham     2500
spam     500
Name: label, dtype: int64

Let's look at one example of ham and one example of spam, to get a feel of what the data looks like

In [5]:
# ham example
print(emails.loc[9].text)

I have been trying to research via SA mirrors and search engines if a canned
script exists giving clients access to their user_prefs options via a
web-based CGI interface. Numerous ISPs provide this feature to clients, but
so far I can find nothing. Our configuration uses Amavis-Postfix and ClamAV
for virus filtering and Procmail with SpamAssassin for spam filtering. I
would prefer not to have to write a script myself, but will appreciate any
suggestions.



-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone?  Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
_______________________________________________
Spamassassin-talk mailing list
Spamassassin-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk




In [6]:
# spam example
print(emails.loc[2990].text)


Get 12 FREE VHS or DVDs!
  Click  HYPERLINK  HERE For Details!
We Only Have HIGH QUALITY Porno Movies to Choose From!
 "This is a VERY SPECIAL, LIMITED TIME OFFER." Get up to 12 DVDs absolutely FREE, with HYPERLINK  NO COMMITMENT!
There's no better deal anywhere.
There's no catches and no gimmicks. You only pay for the shipping, and the DVDs are absolutely free!
Take a Peak at our HYPERLINK   Full Catalog!
 High quality cum filled titles such as:
 HYPERLINK  500 Oral Cumshots 5
Description: 500 Oral Cum Shots! I need hot jiz on my face! Will you cum in my mouth?
 Dozens of Dirty Hardcore titles such as:
 HYPERLINK  Amazing Penetrations No. 17
Description: 4 full hours of amazing penetrations with some of the most beautiful women in porn!
 From our "Sexiest Innocent Blondes" collections:
 HYPERLINK  Audition Tapes
Description: Our girls go from cute, young and innocent, to screaming sex goddess
 beggin' to have massive cocks in their tight, wet pussies and asses!



The **goal** is to build a spam classifier

**Part 0:** Drop rows with missing values.

In [7]:
emails.dropna(axis=0,inplace=True)

**Part 1:** Define X and y from the DataFrame, and then split X and y into training and testing sets, using the text as the only feature and the label (ham/spam) as the target.

In [8]:
X=emails.text

In [9]:
y=emails.label

In [10]:
X_train, X_test,y_train,y_test=train_test_split(X,y)

**Part 2:** build a classification pipeline (tf-idf vectorizer + Naive Bayes model).

In [11]:
pipe=Pipeline(steps=[
    ('vect',TfidfVectorizer()),
    ('clf',MultinomialNB())
])

**Part 3:** Use a grid search to tune the pipeline hyperparameters

In [12]:
params_dic={
    'vect__max_features':[1000,2000],
    'vect__stop_words':['english',None],
    'vect__min_df':[1,5,10],
    'vect__ngram_range':[(1,1),(1,2)],
    'vect__use_idf':[True,False],   
}

In [13]:
grid=GridSearchCV(pipe,params_dic, cv=5, scoring='accuracy', n_jobs=-1,verbose=True)

In [14]:
grid.fit(X_train,y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    8.5s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:   39.0s
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed:   53.2s finished


In [15]:
grid.best_score_

0.9839920811680278

In [16]:
grid.best_params_

{'vect__max_features': 1000,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 2),
 'vect__stop_words': None,
 'vect__use_idf': True}

In [17]:
best_clf=grid.best_estimator_

**Part 4:** Evaluate the performance of your classification pipeline on the test set

In [18]:
y_test_pred=best_clf.predict(X_test)

In [19]:
accuracy_score(y_test,y_test_pred)

0.9813333333333333

In [20]:
confusion_matrix(y_test,y_test_pred)

array([[626,   1],
       [ 13, 110]], dtype=int64)