# Spam Detection
## Project Purpose

This dataset contains text from emails which are marked as spam (1) or not spam (0). The purpose of this mini project is to build a machine learning model that can identify if an email is spam or not.

This project used Scikit-Learn to explore tokenization, vectorization, and statistical classification algorithms.

CountVectorizer() allows for transformation of a body of text into a sparse matrix of numbers (vectors), indicating word counts, that can be passed to ML algorithms. This applies the Bad of Words method, as word placement is not taken into account.

### Initialize Project

In [5]:
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import classification_report, accuracy_score

In [6]:
# Create df from spam emails csv
df = pd.read_csv('emails.csv')

df.head()

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


### Training and Testing Split

In [7]:
text_vec = CountVectorizer().fit_transform(df['text'])

# Set sparse matrix text_vec to X and df['spam'] column to Y
# Shuffle and use test size of 45%
X_train, X_test, y_train, y_test = train_test_split(text_vec,
                                                   df['spam'],
                                                   test_size = 0.45,
                                                   random_state = 42,
                                                    shuffle = True)

### Classifier - Gradient Boosting

In [8]:
classifier = ensemble.GradientBoostingClassifier(n_estimators = 100, # number of decision trees
                                                learning_rate = 0.5, # learning rate
                                                max_depth = 6)

### Generate Predictions

In [9]:
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98      1942
           1       0.98      0.91      0.95       636

   micro avg       0.97      0.97      0.97      2578
   macro avg       0.98      0.95      0.96      2578
weighted avg       0.97      0.97      0.97      2578



97% accuracy

Note to self: Try other classifiers, tweak hyper-parameters more, try different vectorizers