In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# Put in textual contents of the spam documents 1-6
corpus = [
    'I wanted to let you know about money that is available for college in your State. The amount is up to $5,730* if you qualify. It takes like 2 minutes to check if you qualify. Click Here to get matched.',
    "Free-Coupons for next movie. The above links will take you straight to our partner's site. For more information or to see other offers available, you can also visit the Groupon on the Working Advantage website.",
    'Our records indicate your Pension is under performing to see higher growth and up to 25% cash release reply PENSION for a free review. To opt out reply STOP',
    "Enter to win $25,000 and get a Free Hotel Night! Just click here for a $1 trial membership in NetMarket, the Internet'spremier discount shopping site: Fast Company EZVenture gives you FREE business articles,PLUS, you could win YOUR CHOICE of a BMW Z3 convertible, $100,000, shares of Microsoft stock, or a home office computer. Go there and get your chances to win now. A crazy-funny-cool trivia book with a $10,000 prize? PLUS chocolate, nail polish, cats, barnyard animals, and more?",
    "Dear recipient, Avangar Technologies announces the beginning of a new unprecendented global employment campaign. Due to company's exploding growth Avangar is expanding business to the European region. During last employment campaign over 1500 people worldwide took part in Avangar's business and more than half of them are currently employed by the company. And now we are offering you one more opportunity to earn extra money working with Avangar Technologies. We are looking for honest, responsible, hard-working people that can dedicate 2-4 hours of their time per day and earn extra Â£300-500 weekly. All offered positions are currently part-time and give you a chance to work mainly from home.",
    "I know that's an incredible statement, but bear with me while I explain. You have already deleted mail from dozens of 'Get Rich Quick' schemes, chain letter offers, and LOTS of other absurd scams that promise to make you rich overnight with no investment and no work. My offer isn't one of those. What I'm offering is a straightforward computer-based service that you can run full-or part-time like a regular business. This service runs auto-matically while you sleep, vacation, or work a 'regular' job. It provides a valuable new service for businesses in your area. I'm offering a high-tech, low-maintenance, work-from- anywhere business that can bring in a nice comfortable additional income for your family. I did it for eight years. Since I started inviting others to join me, I've helped over 4000 do the same."
]

In [3]:
# put in spam dictionary
spam_dict = list('Free,Click here,visit,open attachment,call this number,money,Out,extra,offer,available,Pension,Opportunity,Chance,Investment'.split(','))

In [4]:
spam_dict

['Free',
 'Click here',
 'visit',
 'open attachment',
 'call this number',
 'money',
 'Out',
 'extra',
 'offer',
 'available',
 'Pension',
 'Opportunity',
 'Chance',
 'Investment']

In [5]:
# spam dictionary has 3-gram word sequences, so the ngram-range here is chosen to be 1 to3.
vectorizer = TfidfVectorizer(vocabulary=spam_dict, analyzer='word', stop_words = 'english', sublinear_tf=True)

# this would calculate tf-idf for words on spam dictionary that show up in corpus.
X = vectorizer.fit_transform(corpus)

In [7]:
d1, d2, d3, d4, d5, d6 = {},{},{},{},{},{}
d= [d1, d2, d3, d4, d5, d6]
for i in range(X.shape[0]):
    # X is a sparse matrix, toarray().squeeze() is to reduce the sparse matrix down to list
    d[i] = {spam_dict_key: X_item for spam_dict_key,X_item in zip(spam_dict, X[i,:].toarray().squeeze())}
print(d)

[{'Free': 0.0, 'Click here': 0.0, 'visit': 0.0, 'open attachment': 0.0, 'call this number': 0.0, 'money': 0.7071067811865476, 'Out': 0.0, 'extra': 0.0, 'offer': 0.0, 'available': 0.7071067811865476, 'Pension': 0.0, 'Opportunity': 0.0, 'Chance': 0.0, 'Investment': 0.0}, {'Free': 0.0, 'Click here': 0.0, 'visit': 0.7732623667832087, 'open attachment': 0.0, 'call this number': 0.0, 'money': 0.0, 'Out': 0.0, 'extra': 0.0, 'offer': 0.0, 'available': 0.6340862024337309, 'Pension': 0.0, 'Opportunity': 0.0, 'Chance': 0.0, 'Investment': 0.0}, {'Free': 0.0, 'Click here': 0.0, 'visit': 0.0, 'open attachment': 0.0, 'call this number': 0.0, 'money': 0.0, 'Out': 0.0, 'extra': 0.0, 'offer': 0.0, 'available': 0.0, 'Pension': 0.0, 'Opportunity': 0.0, 'Chance': 0.0, 'Investment': 0.0}, {'Free': 0.0, 'Click here': 0.0, 'visit': 0.0, 'open attachment': 0.0, 'call this number': 0.0, 'money': 0.0, 'Out': 0.0, 'extra': 0.0, 'offer': 0.0, 'available': 0.0, 'Pension': 0.0, 'Opportunity': 0.0, 'Chance': 0.0, 'In

In [8]:
df = pd.DataFrame(d)
df.index.name="Document"

In [9]:
df

Unnamed: 0_level_0,Chance,Click here,Free,Investment,Opportunity,Out,Pension,available,call this number,extra,money,offer,open attachment,visit
Document,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.707107,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.634086,0.0,0.0,0.0,0.0,0.0,0.773262
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.900003,0.435884,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


### Unfortunately the corpus does not include too frequently the words on the spam dictionary. Therefore tf-idf scores of most words above are zero. However, for the ones that have non-zero tf-idf scores, we can set a threshold, say 0.4, to predict whether a message is spam or not. More data is needed in order to train the model