In the notebook, I explored the spam detection problem by utilizing Naive Bayes method in `sklearn`. A  `CountVectorizer` with default parameters was being used to make the bag of words. Finally, I examined the performance by listing accuracy, precision, recall and f1 score. 

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score


Load data from csv file and clean by droping uncessary columns and renaming the first two columns.

In [None]:
df = pd.read_csv('../input/spam.csv', encoding='latin-1')
df = df[['v1', 'v2']]
df.columns = ['label', 'sms_message']
df.head()

Make numerical labels and print out the shape to understand the data size.

In [None]:
df['label'] = df.label.map({'ham':0, 'spam':1})
print(df.shape)
df.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state = 1)
print("Number of rows in the original set: {}".format(df.shape[0]))
print("Number of rows in the training set: {}".format(X_train.shape[0]))
print("Number of rows in the test set: {}".format(X_test.shape[0]))

By using `CountVectorizer`, we can make the bag of words from our data set.
Note that there are multiple parameters we can experiment to realize their impact on final performance.

In [None]:
# Instantiate the CountVectorizer method
count_vector = CountVectorizer()
print(count_vector)

In [None]:
# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. 
# Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

In [None]:
# Create the Naive Bayes classifier and fit with training set
naive_bayes = MultinomialNB()
naive_bayes.fit(training_data, y_train)

In [None]:
predictions = naive_bayes.predict(testing_data)
print('Accuracy score: {}'.format(accuracy_score(y_test, predictions)))
print('Precision score: {}'.format(precision_score(y_test, predictions)))
print('Recall score: {}'.format(recall_score(y_test, predictions)))
print('F1 score: {}'.format(f1_score(y_test, predictions)))

By utilizing the Naive Bayes method provided by `sklearn`, a high performance classifier for the SMS spam detection can be built easily. A more depth research could be:
1. Experiment with different parameters in `CountVectorizer`.
2. Create my bag of words from scrach and compare with the default `CountVectorizer`.
3. Experiment with different classifiers.