# Let's build a spam classifier

We will use data from `SMS Spam Collection v. 1` described as:

> a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.

([source](http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/))

#### Load useful librairies and data

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

In [None]:
# Load data
data = pd.read_csv(
    "./data/SMSSpamCollection.txt",
    encoding="utf-8",
    header=None,
    delimiter="\t",
    names=["target", "text"],
)

# Encoding target variable
data["target"] = np.where(data["target"] == "spam", 1, 0)

In [None]:
data.sample(3)

## A quick look at the data

In [None]:
print("Dataset contains {} instances of {} variables.".format(data.shape[0], data.shape[1]))

print(
    "It contains {} spam messages ({:.1%} of all)".format(
        data[data.target == 1].shape[0],
        data[data.target == 1].shape[0] / data.shape[0],
    )
)

In [None]:
print(
    "Examples of spam SMS: \n    {}\n    {}".format(
        data[data.target == 1].sample(1).text.iloc[0],
        data[data.target == 1].sample(1).text.iloc[0],
    )
)
print(
    "\nExamples of non-spam SMS: \n    {}\n    {}".format(
        data[data.target == 0].sample(1).text.iloc[0],
        data[data.target == 0].sample(1).text.iloc[0],
    )
)

## Spam classification

We will here build a "vanilla" classifier, without pouring too many thoughts about what the actual messages, spam or not, look like. We will investigate more careful in the notebook `part_3_toward_improving_spam_classifier`.

In [None]:
# Split dataset between train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data["text"], data["target"], random_state=0
)

### CountVectorizer

In [None]:
# Fit the CountVectorizer to the training data
vect = CountVectorizer().fit(X_train)

# transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train)
print("X_train_vectorized: ")
X_train_vectorized

In [None]:
print("X_train shape = {}".format(X_train.shape))
print("Vocabulary length = {}".format(len(vect.vocabulary_)))

In [None]:
# Train the model
model = LogisticRegression(max_iter=1500)
model.fit(X_train_vectorized, y_train)

# Predict the transformed test documents
predictions = model.predict(vect.transform(X_test))

print("AUC = {:.3f}".format(roc_auc_score(y_test, predictions)))

In [None]:
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1]
# so the list returned is in order of largest to smallest
print("Smallest Coefs:\n{}\n".format(feature_names[sorted_coef_index[:10]]))
print("Largest Coefs: \n{}".format(feature_names[sorted_coef_index[:-11:-1]]))

### TF-IDF

In [None]:
# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 3
vect = TfidfVectorizer(min_df=3).fit(X_train)

# transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train)
X_train_vectorized

In [None]:
# Train the model
model = LogisticRegression(max_iter=1500)
model.fit(X_train_vectorized, y_train)

# Predict the transformed test documents
predictions = model.predict(vect.transform(X_test))

print("AUC = {:.3f}".format(roc_auc_score(y_test, predictions)))

In [None]:
feature_names = np.array(vect.get_feature_names())

sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()

print("Smallest tfidf:\n{}\n".format(feature_names[sorted_tfidf_index[:10]]))
print("Largest tfidf: \n{}".format(feature_names[sorted_tfidf_index[:-11:-1]]))

In [None]:
# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1]
# so the list returned is in order of largest to smallest
print("Smallest Coefs:\n{}\n".format(feature_names[sorted_coef_index[:10]]))
print("Largest Coefs: \n{}".format(feature_names[sorted_coef_index[:-11:-1]]))

## Testing our spam classfier

In [None]:
# Your input below
# input_text = "write something here"

# Or use an example for the test set
input_text = X_test.sample(1).iloc[0]
input_text

In [None]:
if model.predict(vect.transform([input_text]))[0] == 1:
    print('This is a spam!')
else:
    print('Not a spam :)')