# Spam Classifier

This is a very basic Spam Classifier. You can download some spams and real emails from your Gmail account and then use those as the training data for building a spam classifier. The classifier is based on Naive Bayes and it uses the count vectorizer to extract the features for spam and ham emails. 

In [1]:
import pandas as pd
import statsmodels.api as sm
import os
import time
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import io
import numpy as np


data_points = []
for root, dirs, files in os.walk(r'\emails'):
    for file in files:
            data_points.append({'message':'\n'.join(io.open(os.path.join(root, file), encoding='latin1')\
                .read().split('\n\n')[1:]), 'class':'spam' if 'spam' in root else 'ham'})

data = pd.DataFrame(data_points)
data.head()

Unnamed: 0,class,message
0,ham,"Date: Wed, 21 Aug 2002 10:54:46 -05..."
1,ham,"Martin A posted:\nTassos Papadopoulos, the Gre..."
2,ham,Man Threatens Explosion In Moscow \nThursday A...
3,ham,Klez: The Virus That Won't Die\n \nAlready the...
4,ham,"> in adding cream to spaghetti carbonara, whi..."


Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we've got a trained spam filter ready to go!

In [2]:
V = CountVectorizer(encoding='latin1')
counts = V.fit_transform(data.message.values)
clf = MultinomialNB()
clf.fit(counts, data['class'].values)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Let's try it out:

In [3]:
predictions = clf.predict(V.transform(['wow! Viagra specially for you. Save thousands of dollars. Buy from us now',\
                         'Hi SJ, are you up for a volleyball game this afternoon?']))
predictions

array(['spam', 'ham'], 
      dtype='|S4')