In [1]:
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
import pandas as pd
from columns import columns

# Spamalot

This project was designed to explore the use of simple Bayesian analysis, separating training and test data, creating a Bayesian classifier, training the classifier, and testing the classifier. The data used for this project determines whether an email would be considered spam or not based on the frequency of certain strings.

##### Reading the data in. The column headers used are kept in a seperate python file to keep this notebook clean.

In [2]:
df = pd.read_csv('spambase.data', names=columns)

##### The x-values include the series in the DataFrame that are used to determine wether an email is considered spam. The y-values is the series in the DataFrame that indicates whether the email in considered spam. The data is then split into trainging and testing data. 

In [3]:
X = df[columns[:-1]]
y = df["SPAM"]

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.6, random_state=42)

##### Creating a Bayesian classifier and training and testing the data.

In [5]:
bay = MultinomialNB()

In [6]:
bay.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [7]:
bay.score(X_train, y_train)

0.78260869565217395

In [8]:
bay.score(X_test, y_test)

0.78109722976643126

##### Predicting with test data and determining that about 61.8% of the test data is not spam. The data provided is known to be 40% spam and 60% not spam. A testing of 61.8% not spam is a decent result considering it is unknown how the data is split amongst the test and train groups.

In [9]:
test = bay.predict(X_test)

In [10]:
not_spam = 0
for item in test:
    if item == 0:
        not_spam += 1

In [11]:
not_spam = not_spam / len(test)
print("Emails not spam: {}%".format(not_spam))

Emails not spam: 0.6181423139598045%


##### Changing the train/test split to determine if a 60/40 split is best.

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=42)

In [13]:
bay = MultinomialNB()
bay.fit(X_train, y_train)
bay.score(X_train, y_train)

0.79130434782608694

In [14]:
bay.score(X_test, y_test)

0.79148566463944392

In [15]:
test = bay.predict(X_test)
not_spam = 0
for item in test:
    if item == 0:
        not_spam += 1
        
not_spam = not_spam / len(test)
print("Emails not spam: {}%".format(not_spam))

Emails not spam: 0.6203301476976543%


##### Changing the train/test split to 75% training and 25% tests results in a higher score by about 0.01; however, the result of finding how many emails are considered not spam increased by less than 1% to 62.0%. Again, it is known that the entire data set includes 60% emails that are classified as not spam, but the exact composition of the testing data set is unknown. 