## Section 0: Download SMS Spam Dataset
We will be working with the [SMS Spam Dataset](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection) provided by the UCI machine learning repository.

Fighting spams is a critical task for many mobile service providers. However, such a task is challenging because of the complexity of natural language. We will apply the natural language processing techniques to build a classifier that can accurately predict whether an SMS text is a spam or not.

Download two text files, namely `spam-train.csv` and `spam-test.csv`. The `spam-train.txt` includes the text of multiple SMS messages as well as their labels (1 for spam and 0 for non-spam). The `spam-test.txt` includes text and labels for another set of SMS messages. Some example data points are listed as follows:
```
label	text
0		Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
1		FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
0		As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
1		WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
1		Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030

```

In [None]:
from urllib.request import urlretrieve
urlretrieve('https://drive.google.com/uc?export=download&id=1VRKLGMIJZjGmSJQMCE_Ukn7WnY4M2q07',
            'spam-train.csv')
urlretrieve('https://drive.google.com/uc?export=download&id=1p4CBU3VSZOjiCeupL4UhIjbJb-bB-Ju5',
            'spam-test.csv')

('spam-test.csv', <http.client.HTTPMessage at 0x7fdee789d550>)

## Section 1: Import Data
Import both `spam-train.csv` and `spam-test.csv`. Create a DataFrame `spam_train` to store the data from `spam-train.csv` and create another DataFrame `spam_test` to store the data from `spam-test.csv`. Report how many SMS messages each csv file has.

In [None]:
import pandas as pd
# read lines from the txt file and extract all the reviews
# read spam_train csv
spam_train = pd.read_csv ('spam-train.csv')
print(spam_train)

# read spam_test csv
spam_test = pd.read_csv ('spam-test.csv')
print(spam_test)


      label                                               text
0         0                       Yep, by the pretty sculpture
1         0                            What you did in  leave.
2         0                   I have to take exam with march 3
3         1  Do you want 750 anytime any network mins 150 t...
4         0  All boys made fun of me today. Ok i have no pr...
...     ...                                                ...
1595      0  She said,'' do u mind if I go into the bedroom...
1596      0             No message..no responce..what happend?
1597      0  Set a place for me in your heart and not in yo...
1598      0  Thanx u darlin!im cool thanx. A few bday drink...
1599      0  You are not bothering me but you have to trust...

[1600 rows x 2 columns]
     label                                               text
0        0  Sorry da:)i was thought of calling you lot of ...
1        1  URGENT! Your Mobile No was awarded a £2,000 Bo...
2        0                       

:## Section 2.1: Preprocess Text Data
In this section, preprocessing the SMS text data in both `spam-train.csv` and `spam-test.csv`. The code tokenizes each SMS text, lowercase each token, remove punctuations and stop words, and conduct stemming for each remaining token.

In [None]:

train_msgs = spam_train.iloc[:, 1]
test_msgs = spam_test.iloc[:, 1]


## download the punkt module
import nltk
nltk.download('punkt')

## get punctuations
import string
punctuations = string.punctuation

## download stop words
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

## import stemmer
from nltk.stem import PorterStemmer
ps = PorterStemmer()


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
## iteratively process each SMS message in spam-train.csv
new_train_msgs = []
review_tokenized_list = []
review_lowercased = []
review_tokenized_lowercase_list = []
print('Result for nltk tokenization:')
for msg in train_msgs:
    ## TODO: insert your code here to process each message in the spam-train.csv
    review_tokenized_list.append(nltk.word_tokenize(msg))
    review_lowercased.append(msg.lower())


for labels in review_lowercased:
    review_tokenized_lowercase_list.append(nltk.word_tokenize(labels))

review_no_punctuations_words = []
review_no_punctuations_list = []
for words in review_tokenized_lowercase_list:
    for word in words:
        if word not in string.punctuation:
            review_no_punctuations_words.append(word)
    review_no_punctuations_list.append(review_no_punctuations_words)
    review_no_punctuations_words = []

# remove stop words
stop_words = set(stopwords.words('english'))
review_no_stopwords = []
review_no_stopwords_list = []
for label in review_no_punctuations_list:
    for word in label:
      if word not in stop_words:
          review_no_stopwords.append(word)
    review_no_stopwords_list.append(review_no_stopwords)
    review_no_stopwords = []

# stemming the text
review_stemmed = []
review_stemmed_list_train = []
for token in review_no_stopwords_list:
  for word in token:
    review_stemmed.append(ps.stem(word))
  review_stemmed_list_train.append(review_stemmed)
  review_stemmed = []

### append the processed review into new list
final_string_train = []
for strings in review_stemmed_list_train:
    review_processed_string = " ".join(strings)
    final_string_train.append(review_processed_string)
    review_processed_string = ""


print(review_tokenized_list)
print(review_lowercased)
print(review_tokenized_lowercase_list)
print(review_no_punctuations_list)
print(review_no_stopwords_list)
print(review_stemmed_list_train)
print(final_string_train)

Result for nltk tokenization:
[['yep', 'pretti', 'sculptur'], ['leav'], ['take', 'exam', 'march', '3'], ['want', '750', 'anytim', 'network', 'min', '150', 'text', 'new', 'video', 'phone', 'five', 'pound', 'per', 'week', 'call', '08000776320', 'repli', 'deliveri', 'tomorrow'], ['boy', 'made', 'fun', 'today', 'ok', 'problem', 'sent', 'one', 'messag', 'fun'], ['ok', '....', 'take', 'care.umma', '...'], ['o.', 'well', 'uv', 'caus', 'mutat', 'sunscreen', 'like', 'essenti', 'theseday'], ['free', 'top', 'rington', '-sub', 'weekli', 'ringtone-get', '1st', 'week', 'free-send', 'subpoli', '81618-', '3', 'per', 'week-stop', 'sms-08718727870'], ['hey', 'gal', '...', 'u', 'wan', 'na', 'meet', '4', 'dinner', 'nìte'], ["n't", 'come', 'home', 'class', 'right', 'need', 'work', 'shower'], ['good', 'word', '....', 'word', 'may', 'leav', 'u', 'dismay', 'mani', 'time'], ['mark', 'work', 'tomorrow', 'get', '5.', 'work', 'hous', 'meet', 'u', 'afterward'], ['prasanth', 'ettan', 'mother', 'pass', 'away', 'last

In [None]:
## iteratively process each SMS message in spam-test.csv
new_test_msgs = []
review_tokenized_list = []
review_lowercased = []
review_tokenized_lowercase_list = []
print('Result for nltk tokenization:')
for msg in test_msgs:
    ## TODO: insert your code here to process each message in the spam-train.csv
    review_tokenized_list.append(nltk.word_tokenize(msg))
    review_lowercased.append(msg.lower())

for labels in review_lowercased:
    review_tokenized_lowercase_list.append(nltk.word_tokenize(labels))

review_no_punctuations_words = []
review_no_punctuations_list = []
for words in review_tokenized_lowercase_list:
    for word in words:
        if word not in string.punctuation:
            review_no_punctuations_words.append(word)
    review_no_punctuations_list.append(review_no_punctuations_words)
    review_no_punctuations_words = []

# remove stop words
stop_words = set(stopwords.words('english'))
review_no_stopwords = []
review_no_stopwords_list = []
for label in review_no_punctuations_list:
    for word in label:
      if word not in stop_words:
          review_no_stopwords.append(word)
    review_no_stopwords_list.append(review_no_stopwords)
    review_no_stopwords = []

# stemming the text
review_stemmed = []
review_stemmed_list_test = []
for token in review_no_stopwords_list:
  for word in token:
    review_stemmed.append(ps.stem(word))
  review_stemmed_list_test.append(review_stemmed)
  review_stemmed = []


### append the processed review into new list
final_string_test = []
for lists in review_stemmed_list_test:
    review_processed_string = " ".join(lists)
    final_string_test.append(review_processed_string)
    review_processed_string = ""



print(review_tokenized_list)
print(review_lowercased)
print(review_tokenized_lowercase_list)
print(review_no_punctuations_list)
print(review_no_stopwords_list)
print(review_stemmed_list_test)
print(final_string_test)

Result for nltk tokenization:
[['Sorry', 'da', ':', ')', 'i', 'was', 'thought', 'of', 'calling', 'you', 'lot', 'of', 'times', ':', ')', 'lil', 'busy.i', 'will', 'call', 'you', 'at', 'noon', '..'], ['URGENT', '!', 'Your', 'Mobile', 'No', 'was', 'awarded', 'a', '£2,000', 'Bonus', 'Caller', 'Prize', 'on', '1/08/03', '!', 'This', 'is', 'our', '2nd', 'attempt', 'to', 'contact', 'YOU', '!', 'Call', '0871-4719-523', 'BOX95QU', 'BT', 'National', 'Rate'], ['U', 'can', 'call', 'now', '...'], ['Then', 'we', 'wait', '4', 'u', 'lor', '...', 'No', 'need', '2', 'feel', 'bad', 'lar', '...'], ['WIN', ':', 'We', 'have', 'a', 'winner', '!', 'Mr.', 'T.', 'Foley', 'won', 'an', 'iPod', '!', 'More', 'exciting', 'prizes', 'soon', ',', 'so', 'keep', 'an', 'eye', 'on', 'ur', 'mobile', 'or', 'visit', 'www.win-82050.co.uk'], ['I', 'cant', 'wait', 'to', 'see', 'you', '!', 'How', 'were', 'the', 'photos', 'were', 'useful', '?', ':', ')'], ['Ya', 'very', 'nice.', '.', '.be', 'ready', 'on', 'thursday'], ['Check', 'wit

## Section 2.2: Convert Text Data with TF-IDF
After data preprocessing, the next step is to convert each SMS message into a numerical vector using TF-IDF.


In [None]:
# we can apply TfidfVectorizer function to extract TF-IDF vectors for each message
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_transformer = TfidfVectorizer(max_features=500)

# TODO: replace the question marks '?' with the correct variables you created in Section 2.1
tfidf_transformer.fit(final_string_train)
train_features = tfidf_transformer.transform(final_string_train).toarray()
test_features = tfidf_transformer.transform(final_string_test).toarray()

print(train_features)
print(test_features)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.29493455 0.29493455 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


## Section 3: Apply Machine Learning to Combat Spam Messages
In this section, we need to train a logistic regression model & Naive Bayes classifier to predict whether a given SMS message is a spam message or not. Specifically, we need to use the TF-IDF vectors of the training dataset (`train_features`) to train the classifier and use the TF-IDF vectors of the test dataset (`test_features`) for prediction. Once the prediction is finished, we report the F1 score and AUC-ROC on the test dataset for the Naive Bayes classifier.





In [None]:
"""
 Logistic Regression
"""
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, roc_auc_score

log_clf = LogisticRegression()
log_clf.fit(train_features, spam_train['label'])        # fit the model
log_clf_pred = log_clf.predict(test_features)           # make predictions
log_clf_score = log_clf.predict_proba(test_features)    # get prediction scores

## F1 score
log_clf_f1 = f1_score(spam_test['label'], log_clf_pred)
print('Prediction F1: {:.4f}'.format(log_clf_f1))

## AUC-ROC
log_clf_auc = roc_auc_score(spam_test['label'], log_clf_score[:, 1])
print('AUC-ROC : {:.4f}'.format(log_clf_auc))
print()

# Logistic regression features
log_clf_coef = pd.DataFrame({
    'Feature Name': tfidf_transformer.get_feature_names_out(),
    'Coefficient': log_clf.coef_[0]
})
print(log_clf_coef)

Prediction F1: 0.7963
AUC-ROC : 0.9877

    Feature Name  Coefficient
0            000     0.800577
1             03     0.754461
2    08000930705     0.571457
3             10     1.132362
4            100     0.722262
..           ...          ...
495    yesterday    -0.142622
496          yet    -0.297686
497           yo    -0.358428
498           yr     0.198975
499          yup    -0.290493

[500 rows x 2 columns]


In [None]:
'''
Naive Bayes Model
'''
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score,\
    precision_score, f1_score, roc_auc_score

# Model ran
gnb = GaussianNB()
gnb.fit(train_features, spam_train['label'])                  # fit the model
gnb_pred_t = gnb.predict(test_features)                       # make predictions
gnb_score_t = gnb.predict_proba(test_features)                # get prediction scores

## accuracy
gnb_acc = accuracy_score(spam_test['label'], gnb_pred_t)
print('Prediction accuracy: {:.4f}'.format(gnb_acc))

## recall
gnb_recall = recall_score(spam_test['label'], gnb_pred_t)
print('Prediction recall: {:.4f}'.format(gnb_recall))

## precision
gnb_precision = precision_score(spam_test['label'], gnb_pred_t)
print('Prediction precision: {:.4f}'.format(gnb_precision))

## F1 score
gnb_f1 = f1_score(spam_test['label'], gnb_pred_t)
print('Prediction F1: {:.4f}'.format(gnb_f1))

## AUC-ROC
gnb_auc = roc_auc_score(spam_test['label'], gnb_score_t[:, 1])
print('AUC-ROC : {:.4f}'.format(gnb_auc))
print()

Prediction accuracy: 0.7375
Prediction recall: 0.9508
Prediction precision: 0.3625
Prediction F1: 0.5249
AUC-ROC : 0.8248

