spam

Goal: Train a model to detect spam based on the Spambase Data Set (https://archive.ics.uci.edu/ml/datasets/Spambase).

Run

To run self-contained:

pip3 install -r requirements.txt
python3 run.py

Or load spam.ipynb in ipython/jupyter notebook. This includes a small view of the dataset to get a feel for its complexity.

Method

This analysis has the following core steps:

read in and prepare dataset
possibly apply dimensionality reduction (transform data)
split data into training and testing sets
train model on training set
test model by comparing expected and predicted results from the testing set
compute accuracy and error
pick winner

In this case we do not only want to use one algorithm, instead we use a Naive Bayes classifier and a Support Vector Machine (SVM). These are selected as they tend to perform well (but differently) on this type of (non-categorical) dataset. We also compare two decomposition methods: Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA). We compare the results when using all features of the data, as well as when only using the best and top ten decomposed features. This will tell us if all the features present in the dataset are actually predictive of the output. In total we compare 10 combinations.

For each combination run 15 iterations of training and testing, using a different subsets of the dataset (cross-validation). This ensures we don't accidentally pick a very homogeneous subset that doesn't support the features of the whole dataset. We use 80% of the data for training and 20% for subsequent testing.

Results

The best result for the combinations given is an accuracy of 0.913 with an error of 0.006. This is achieved using an LDA to reduce the dimensionality with 10 components and an SVM to train the classification model. Using 1 component is only marginally less accurate, this also depends on the seed, so results may vary slightly between runs. Generally, the Naive Bayes algorithm performs worse than the SVM but is trained much faster.

Next steps

With the current setup, it is easy to add further classification algorithms and decompositions. Examples are k-Nearest Neighbour, Decision Trees, Random Forests. Similary more combinations of components could be added by setting up an automatic grid search over the parameters.

Appendix

On my machine, the following results are obtained

# Using all components.
## svm:
### Accuracy of 0.8373507057546146
### Error 0.003962395907138027
### Confusion Matrix
[[ 0.51646761  0.08975751]
 [ 0.07289178  0.3208831 ]]


## bayes:
### Accuracy of 0.8243937748823743
### Error 0.007406228497168452
### Confusion Matrix
[[ 0.44618169  0.15888527]
 [ 0.01672096  0.37821209]]


# Using pca
# Using 1 components.
## svm:
### Accuracy of 0.6944625407166125
### Error 0.005924731296883787
### Confusion Matrix
[[ 0.4723127   0.13637351]
 [ 0.16916395  0.22214984]]


## bayes:
### Accuracy of 0.6542164314151285
### Error 0.002000689269519709
### Confusion Matrix
[[ 0.58487152  0.02272892]
 [ 0.32305465  0.06934491]]


# Using lda
# Using 1 components.
## svm:
### Accuracy of 0.9109663409337676
### Error 0.004394348618690526
### Confusion Matrix
[[ 0.56728194  0.03923272]
 [ 0.04980094  0.3436844 ]]


## bayes:
### Accuracy of 0.8984437205935577
### Error 0.005735476906523811
### Confusion Matrix
[[ 0.57690916  0.02909881]
 [ 0.07245747  0.32153456]]


# Using pca
# Using 10 components.
## svm:
### Accuracy of 0.8014477017734347
### Error 0.016226206152889725
### Confusion Matrix
[[ 0.50503076  0.10025335]
 [ 0.09829895  0.29641694]]


## bayes:
### Accuracy of 0.6728193992037642
### Error 0.004899393765058115
### Confusion Matrix
[[ 0.58733261  0.01838581]
 [ 0.30879479  0.08548679]]


# Using lda
# Using 10 components.
## svm:
### Accuracy of 0.9130655085052479
### Error 0.006490451023549525
### Confusion Matrix
[[ 0.56293883  0.04010134]
 [ 0.04683315  0.35012667]]


## bayes:
### Accuracy of 0.8942453854505973
### Error 0.005786405828846596
### Confusion Matrix
[[ 0.57032211  0.03279045]
 [ 0.07296417  0.32392327]]


##############################
Best performing combination
Algorithm: svm
Components: 10
Decomposition: lda
Accuracy: 0.913066
Error: 0.006490

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data/spambase		data/spambase
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
helper.py		helper.py
requirements.txt		requirements.txt
run.py		run.py
spam.ipynb		spam.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data/spambase

data/spambase

.gitignore

.gitignore

README.md

README.md

init.py

init.py

helper.py

helper.py

requirements.txt

requirements.txt

run.py

run.py

spam.ipynb

spam.ipynb

Repository files navigation

spam

Run

Method

Results

Next steps

Appendix

About

Releases

Packages

Languages

rhngit/spam

Folders and files

Latest commit

History

Repository files navigation

spam

Run

Method

Results

Next steps

Appendix

About

Resources

Stars

Watchers

Forks

Languages