The deadline is 9:30am Feb 9th (Wed).   
You should submit a `.ipynb` file with your solutions to BrightSpace.

--- 

There are 10 extra points for "adding extra features to your model". But the maximum grade you can obtain in this homework is 100%. If you complete the extra-credit task, your score will be min{10+score, 100}.

---


In this homework we will preprocess SMS Spam Collection Dataset and train a bag-of-words classifier (logistic regression) for spam detection. 

## Data Loading (10 points)

First, we download the SMS Spam Collection Dataset. The dataset is taken from [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset/data#) and loaded to [Google Drive](https://drive.google.com/open?id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR) so that everyone can access it.

In [1]:
!wget 'https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR' -O spam.csv 

--2022-02-08 18:57:59--  https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR
Resolving docs.google.com (docs.google.com)... 142.250.141.102, 142.250.141.113, 142.250.141.139, ...
Connecting to docs.google.com (docs.google.com)|142.250.141.102|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/1b13231tbg8cgsg1pnql77plja3l2ril/1644346650000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download [following]
--2022-02-08 18:57:59--  https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/1b13231tbg8cgsg1pnql77plja3l2ril/1644346650000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download
Resolving doc-14-04-docs.googleusercontent.com (doc-14-04-docs.googleusercontent.com)... 142.250.141.132, 2607:f8b0:4023:c0b::84
Connecting to doc-14-04-docs.googleusercontent.com (doc-14

In [2]:
!ls

sample_data  spam.csv


There are two columns: `v1` -- spam or ham indicator, `v2` -- text of the message.

In [3]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" 

In [4]:
import pandas as pd
import numpy as np

df = pd.read_csv("spam.csv", usecols=["v1", "v2"], encoding='latin-1')
# 1 - spam, 0 - ham
df.v1 = (df.v1 == "spam").astype("int")
df.head()

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Your task is to split the data to train/dev/test (don't forget to shuffle the data). Make sure that each row appears only in one of the splits.

In [5]:
# 0.15 for val, 0.15 for test, 0.7 for train
val_size = int(df.shape[0] * 0.15)
test_size = int(df.shape[0] * 0.15)

"""
YOUR CODE GOES HERE
"""
print('')

# shuffle the data 
df = df.sample(frac=1, random_state=0).reset_index(drop=True)
print(f'Before removing duplicates, there are {len(df)} messages')
df = df.drop_duplicates(['v2', 'v1'])
print(f'After removing duplicates, there are {len(df)} messages')

# split the data to train/dev/test 
train_texts, train_labels = df["v2"][:len(df)-val_size-test_size].reset_index(drop=True), df["v1"][:len(df)-val_size-test_size].reset_index(drop=True)
val_texts, val_labels     = df["v2"][len(df)-val_size-test_size:len(df)-test_size].reset_index(drop=True), df["v1"][len(df)-val_size-test_size:len(df)-test_size].reset_index(drop=True)
test_texts, test_labels   = df["v2"][len(df)-test_size:].reset_index(drop=True), df["v1"][len(df)-test_size:].reset_index(drop=True)

'\nYOUR CODE GOES HERE\n'


Before removing duplicates, there are 5572 messages
After removing duplicates, there are 5169 messages


## Data Processing (40 points)

The task is to create bag-of-words features: tokenize the text, index each token, represent the sentence as a dictionary of tokens and their counts, limit the vocabulary to $n$ most frequent tokens. In the lab we use built-in `sklearn` function, `sklearn.feature_extraction.text.CountVectorizer`. 
**In this HW, you are required to implement the `Vectorizer` on your own without using `sklearn` built-in functions.**

Function `preprocess_data` takes the list of texts and returns list of (lists of tokens). 
You may use [spacy](https://spacy.io/) or [nltk](https://www.nltk.org/) text processing libraries in `preprocess_data` function. 

Class `Vectorizer` is used to vectorize the text and to create a matrix of features.


In [6]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

from collections import Counter

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [7]:
def preprocess_data(data):
    # This function should return a list of lists of preprocessed tokens for each message
    """
    YOUR CODE GOES HERE
    """
    token_lst = []
    for txt in data:
      token_lst.append(word_tokenize(txt))
    preprocessed_data = token_lst
    return preprocessed_data

train_data = preprocess_data(train_texts)
val_data = preprocess_data(val_texts)
test_data = preprocess_data(test_texts)

In [8]:
train_data

[['Aight',
  'should',
  'I',
  'just',
  'plan',
  'to',
  'come',
  'up',
  'later',
  'tonight',
  '?'],
 ['Was', 'the', 'farm', 'open', '?'],
 ['I',
  'sent',
  'my',
  'scores',
  'to',
  'sophas',
  'and',
  'i',
  'had',
  'to',
  'do',
  'secondary',
  'application',
  'for',
  'a',
  'few',
  'schools',
  '.',
  'I',
  'think',
  'if',
  'you',
  'are',
  'thinking',
  'of',
  'applying',
  ',',
  'do',
  'a',
  'research',
  'on',
  'cost',
  'also',
  '.',
  'Contact',
  'joke',
  'ogunrinde',
  ',',
  'her',
  'school',
  'is',
  'one',
  'me',
  'the',
  'less',
  'expensive',
  'ones'],
 ['Was',
  'gr8',
  'to',
  'see',
  'that',
  'message',
  '.',
  'So',
  'when',
  'r',
  'u',
  'leaving',
  '?',
  'Congrats',
  'dear',
  '.',
  'What',
  'school',
  'and',
  'wat',
  'r',
  'ur',
  'plans',
  '.'],
 ['In',
  'that',
  'case',
  'I',
  'guess',
  'I',
  "'ll",
  'see',
  'you',
  'at',
  'campus',
  'lodge'],
 ['Nothing',
  'will',
  'ever',
  'be',
  'easy',
  '.',


In [9]:
import numpy as np

class Vectorizer():
    def __init__(self, max_features):
        self.max_features = max_features
        self.vocab_list = None
        self.token_to_index = None

    def fit(self, dataset):
        # Create a vocab list, self.vocab_list, using the most frequent "max_features" tokens
        # Create a token indexer, self.token_to_index, that will map each token in self.vocab 
        # to its corresponding index in self.vocab_list
        """
        YOUR CODE GOES HERE
        """
        dataset_ = []
        for txt in dataset:
          dataset_.extend(txt)
        freq_lst = Counter(dataset_).most_common(n=max_features)
        self.vocab_list = [token for token, freq in freq_lst]

    def transform(self, dataset):
        # This function transforms text dataset into a matrix, data_matrix
        """
        YOUR CODE GOES HERE
        """
        data_matrix = np.zeros((len(dataset), len(self.vocab_list)))
        for txt_idx, self.vocab in enumerate(dataset):
          self.token_to_index = data_matrix[txt_idx]
          txt_token_count = Counter(self.vocab)
          for token, freq in txt_token_count.items():
            if token in self.vocab_list:
              idx = self.vocab_list.index(token)
              self.token_to_index[idx] = freq
          data_matrix[txt_idx] = self.token_to_index
        print(data_matrix)
        return data_matrix

In [10]:
max_features = 2000 # TODO: Replace None with a number
vectorizer = Vectorizer(max_features=max_features)
vectorizer.fit(train_data)
X_train = vectorizer.transform(train_data)
X_val = vectorizer.transform(val_data)
X_test = vectorizer.transform(test_data)

y_train = np.array(train_labels)
y_val = np.array(val_labels)
y_test = np.array(test_labels)

vocab = vectorizer.vocab_list

[[0. 1. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [2. 2. 2. ... 0. 0. 0.]
 ...
 [0. 1. 1. ... 0. 0. 0.]
 [2. 1. 0. ... 0. 0. 0.]
 [0. 0. 3. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [1. 0. 1. ... 0. 0. 0.]
 ...
 [2. 2. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [2. 1. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [0. 0. 2. ... 0. 0. 0.]
 ...
 [0. 0. 2. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [3. 2. 1. ... 0. 0. 0.]]


(10 extra points) You can add more features to the feature matrix.

In [11]:
"""
YOUR CODE GOES HERE
"""

'\nYOUR CODE GOES HERE\n'

## Model

We train logistic regression model and save prediction for train, val and test.


In [12]:
from sklearn.linear_model import LogisticRegression

# Define Logistic Regression model
model = LogisticRegression(random_state=0, solver='liblinear')

# Fit the model to training data
model.fit(X_train, y_train)

# Make prediction using the trained model
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

LogisticRegression(random_state=0, solver='liblinear')

## Performance of the model (30 points)

Your task is to report train, val, test accuracies and F1 scores. **You are required to implement `accuracy_score` and `f1_score` methods without using built-in python functions.** 

Your model should achieve at least **0.95** test accuracy and **0.90** test F1 score.

In [13]:
def accuracy_score(y_true, y_pred): 
    # Calculate accuracy of the model's prediction
    """
    YOUR CODE GOES HERE
    """
    accuracy = np.sum(y_true==y_pred)/len(y_true)
    return accuracy

def f1_score(y_true, y_pred): 
    # Calculate F1 score of the model's prediction
    """
    YOUR CODE GOES HERE
    """
    precision = np.sum([i for idx, i in enumerate(y_pred) if (y_true[idx]==1 and i==1)])/np.sum(y_pred)
    recall = np.sum([i for idx, i in enumerate(y_true) if (y_pred[idx]==1 and i==1)])/np.sum(y_true)
    f1 = 2 * precision * recall/(precision + recall)
    return f1

In [14]:
print(f"Training accuracy: {accuracy_score(y_train, y_train_pred):.3f}, "
      f"F1 score: {f1_score(y_train, y_train_pred):.3f}")
print(f"Validation accuracy: {accuracy_score(y_val, y_val_pred):.3f}, "
      f"F1 score: {f1_score(y_val, y_val_pred):.3f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred):.3f}, "
      f"F1 score: {f1_score(y_test, y_test_pred):.3f}")

Training accuracy: 0.995, F1 score: 0.981
Validation accuracy: 0.978, F1 score: 0.889
Test accuracy: 0.982, F1 score: 0.921


**Question.**
Is accuracy the metric that logistic regression optimizes while training? If no, which metric is optimized in logistic regression?

**Your answer:** No. In optimization process, logistic regression uses L(P(y|x),c)=-log(P(y=c)|x) as its loss function and try to optimize itself by iteratively updating its parameters.

**Question.**
In general, does having 0.99 accuracy on test means that the model is great? If no, can you give an example of a case when the accuracy is high but the model is not good? (Hint: why do we use F1 score?)

**Your answer:** No. For unbalanced data, high accuracy may still indicate a bad classifier. For example, in fraud detection cases, the ratio of true negative cases to true positive cases can be 1 to 1000. Suppose that a bad classifier simply predicts all cases as positive, the accuracy can be extremely high. However, this contradicts with its goal to find out the true negative cases to prevent fraud from happening. In this case, F1 score can be more informative in evaluating, as it takes both precision and recall into account.

### Exploration of predicitons (20 points)

Show a few examples with true+predicted labels on the train and val sets.

In [15]:
"""
YOUR CODE GOES HERE
"""
# 1 - spam, 0 - ham
df_train = pd.DataFrame(list(zip(train_texts, y_train, y_train_pred)), columns=['train_set_text', 'true_label', 'predicted_label'])
df_val = pd.DataFrame(list(zip(val_texts, y_val, y_val_pred)), columns=['val_set_text', 'true_label', 'predicted_label'])

print(" ")
print("*** Examples of train sets whose true labels are the same as predicted labels ***")
df_train_s = df_train[df_train["true_label"]==df_train["predicted_label"]]
for i in range(5):
  print(f"[text]: {df_train_s.iloc[i][0]}")
  print(f"[true label]: {df_train_s.iloc[i][1]}")
  print(f"[predicted label]: {df_train_s.iloc[i][2]}")
  print(" ")
  
print(" ")
print("*** Examples of train sets whose true labels are not the same as predicted labels ***")
df_train_ns = df_train[df_train["true_label"]!=df_train["predicted_label"]]
for i in range(5):
  print(f"[text]: {df_train_ns.iloc[i][0]}")
  print(f"[true label]: {df_train_ns.iloc[i][1]}")
  print(f"[predicted label]: {df_train_ns.iloc[i][2]}")
  print(" ")

print(" ")
print("*** Examples of val sets whose true labels are the same as predicted labels ***")
df_val_s = df_val[df_val["true_label"]==df_val["predicted_label"]]
for i in range(5):
  print(f"[text]: {df_val_s.iloc[i][0]}")
  print(f"[true label]: {df_val_s.iloc[i][1]}")
  print(f"[predicted label]: {df_val_s.iloc[i][2]}")
  print(" ")

print(" ")
print("*** Examples of val sets whose true labels are not the same as predicted labels ***")
df_val_ns = df_val[df_val["true_label"]!=df_val["predicted_label"]]
for i in range(5):
  print(f"[text]: {df_val_ns.iloc[i][0]}")
  print(f"[true label]: {df_val_ns.iloc[i][1]}")
  print(f"[predicted label]: {df_val_ns.iloc[i][2]}")
  print(" ")

'\nYOUR CODE GOES HERE\n'

 
*** Examples of train sets whose true labels are the same as predicted labels ***
[text]: Aight should I just plan to come up later tonight?
[true label]: 0
[predicted label]: 0
 
[text]: Was the farm open?
[true label]: 0
[predicted label]: 0
 
[text]: I sent my scores to sophas and i had to do secondary application for a few schools. I think if you are thinking of applying, do a research on cost also. Contact joke ogunrinde, her school is one me the less expensive ones
[true label]: 0
[predicted label]: 0
 
[text]: Was gr8 to see that message. So when r u leaving? Congrats dear. What school and wat r ur plans.
[true label]: 0
[predicted label]: 0
 
[text]: In that case I guess I'll see you at campus lodge
[true label]: 0
[predicted label]: 0
 
 
*** Examples of train sets whose true labels are not the same as predicted labels ***
[text]: http//tms. widelive.com/index. wml?id=820554ad0a1705572711&first=trueåÁC C RingtoneåÁ
[true label]: 1
[predicted label]: 0
 
[text]: ringtoneking 

**Question** Print 10 examples from val set which were labeled incorrectly by the model. Why do you think the model got them wrong?

**Your answer:** 
- In terms of these 3 false positive cases, they are misclassified since they may contain frequent words appearing in spam such as "THATåÕS AL!!!!!!!!!", "sms NO to  &lt;#&gt", "miss call". 
- In terms of these 7 false negative cases, the misclassification may result from the fact that we only care about counts of the word in our training, in which process we can easily loss context of the word. Frequent words such as "bill", "Account Statement" can be deemed as ham words, although the combinations of such words can be strong indicator of spam such as "allow company to bill for SMS", "Account Statement for 708". 

In [16]:
"""
YOUR CODE GOES HERE
"""
df_wrong_pre_val = df_val[df_val.true_label-df_val.predicted_label==1] # false negative
df_wrong_pre_val_ = df_val[df_val.true_label-df_val.predicted_label==-1] # false positive
df_wrong_pre_val.head(5)
df_wrong_pre_val_.head(5)

'\nYOUR CODE GOES HERE\n'

Unnamed: 0,val_set_text,true_label,predicted_label
1,Check Out Choose Your Babe Videos @ sms.shsex....,1,0
61,"0A$NETWORKS allow companies to bill for SMS, s...",1,0
150,PRIVATE! Your 2003 Account Statement for 078,1,0
169,ROMCAPspam Everyone around should be respondin...,1,0
315,Guess who am I?This is the first time I create...,1,0


Unnamed: 0,val_set_text,true_label,predicted_label
9,Miss call miss call khelate kintu opponenter m...,0,1
174,Y?WHERE U AT DOGBREATH? ITS JUST SOUNDING LIKE...,0,1
651,I (Career Tel) have added u as a contact on IN...,0,1


In [18]:
df_txt = df_wrong_pre_val.val_set_text.tolist()
df_true_label = df_wrong_pre_val.true_label.tolist()
df_pred_label = df_wrong_pre_val.predicted_label.tolist()
print(" ")
print("*** 7 false negative examples from val sets ***")
for i in range(7):
  print(f"[text]: {df_txt[i]}")
  print(f"[true label]: {df_true_label[i]}")
  print(f"[predicted label]: {df_pred_label[i]}")
  print(" ")

df_txt = df_wrong_pre_val_.val_set_text.tolist()
df_true_label = df_wrong_pre_val_.true_label.tolist()
df_pred_label = df_wrong_pre_val_.predicted_label.tolist()
print(" ")
print("*** 3 false positive examples from val sets ***")
for i in range(3):
  print(f"[text]: {df_txt[i]}")
  print(f"[true label]: {df_true_label[i]}")
  print(f"[predicted label]: {df_pred_label[i]}")
  print(" ")

 
*** 7 false negative examples from val sets ***
[text]: Check Out Choose Your Babe Videos @ sms.shsex.netUN fgkslpoPW fgkslpo
[true label]: 1
[predicted label]: 0
 
[text]: 0A$NETWORKS allow companies to bill for SMS, so they are responsible for their \suppliers\"
[true label]: 1
[predicted label]: 0
 
[text]: PRIVATE! Your 2003 Account Statement for 078
[true label]: 1
[predicted label]: 0
 
[text]: ROMCAPspam Everyone around should be responding well to your presence since you are so warm and outgoing. You are bringing in a real breath of sunshine.
[true label]: 1
[predicted label]: 0
 
[text]: Guess who am I?This is the first time I created a web page WWW.ASJESUS.COM read all I wrote. I'm waiting for your opinions. I want to be your friend 1/1
[true label]: 1
[predicted label]: 0
 
[text]: The current leading bid is 151. To pause this auction send OUT. Customer Care: 08718726270
[true label]: 1
[predicted label]: 0
 
[text]: SMS. ac Sptv: The New Jersey Devils and the Detroit Red 