The deadline is 9:30am Feb 9th (Wed).   
You should submit a `.ipynb` file with your solutions to BrightSpace.

--- 

There are 10 extra points for "adding extra features to your model". But the maximum grade you can obtain in this homework is 100%. If you complete the extra-credit task, your score will be min{10+score, 100}.

---


In this homework we will preprocess SMS Spam Collection Dataset and train a bag-of-words classifier (logistic regression) for spam detection. 

## Data Loading (10 points)

First, we download the SMS Spam Collection Dataset. The dataset is taken from [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset/data#) and loaded to [Google Drive](https://drive.google.com/open?id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR) so that everyone can access it.

In [1]:
!wget 'https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR' -O spam.csv 

--2022-02-14 04:25:00--  https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR
Resolving docs.google.com (docs.google.com)... 74.125.129.102, 74.125.129.139, 74.125.129.113, ...
Connecting to docs.google.com (docs.google.com)|74.125.129.102|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/5vemoj3s65ql3osc1lmmbg29d3fb8ero/1644812700000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download [following]
--2022-02-14 04:25:00--  https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/5vemoj3s65ql3osc1lmmbg29d3fb8ero/1644812700000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download
Resolving doc-14-04-docs.googleusercontent.com (doc-14-04-docs.googleusercontent.com)... 209.85.147.132, 2607:f8b0:4001:c20::84
Connecting to doc-14-04-docs.googleusercontent.com (doc-14-04-docs.goog

In [2]:
!ls

sample_data  spam.csv


There are two columns: `v1` -- spam or ham indicator, `v2` -- text of the message.

In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv("spam.csv", usecols=["v1", "v2"], encoding='latin-1')
# 1 - spam, 0 - ham
df.v1 = (df.v1 == "spam").astype("int")
df.head()

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Your task is to split the data to train/dev/test (don't forget to shuffle the data). Make sure that each row appears only in one of the splits.

In [4]:
# 0.15 for val, 0.15 for test, 0.7 for train
val_size = int(df.shape[0] * 0.15)
test_size = int(df.shape[0] * 0.15)

"""
YOUR CODE GOES HERE
"""
df = df.sample(frac=1, random_state=123).reset_index().drop(columns='index')

val_labels, val_texts     = df.loc[0:val_size, 'v1'].reset_index().drop(columns='index'), df.loc[0:val_size, 'v2'].reset_index().drop(columns='index')
test_labels, test_texts   = df.loc[val_size:val_size + test_size, 'v1'].reset_index().drop(columns='index'), df.loc[val_size:val_size + test_size, 'v2'].reset_index().drop(columns='index')
train_labels, train_texts = df.loc[test_size:, 'v1'].reset_index().drop(columns='index'), df.loc[test_size:, 'v2'].reset_index().drop(columns='index')

In [5]:
val_labels

Unnamed: 0,v1
0,0
1,1
2,0
3,1
4,0
...,...
831,0
832,1
833,0
834,0


## Data Processing (40 points)

The task is to create bag-of-words features: tokenize the text, index each token, represent the sentence as a dictionary of tokens and their counts, limit the vocabulary to $n$ most frequent tokens. In the lab we use built-in `sklearn` function, `sklearn.feature_extraction.text.CountVectorizer`. 
**In this HW, you are required to implement the `Vectorizer` on your own without using `sklearn` built-in functions.**

Function `preprocess_data` takes the list of texts and returns list of (lists of tokens). 
You may use [spacy](https://spacy.io/) or [nltk](https://www.nltk.org/) text processing libraries in `preprocess_data` function. 

Class `Vectorizer` is used to vectorize the text and to create a matrix of features.


In [6]:
import spacy
nlp = spacy.load('en_core_web_sm')
import nltk
nltk.download('punkt')
from tqdm import tqdm

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [7]:
def preprocess_data(data, option:str):
  # This function should return a list of lists of preprocessed tokens for each message
  """
  YOUR CODE GOES HERE
  """
  preprocessed_data = []

  if option == 'all':
    for row in tqdm(data.loc[:,'v2']):
      doc = nlp(row)
      tokens = [token.text for token in doc]

      preprocessed_data.append(tokens)

  if option == 'words':
    for row in tqdm(data.loc[:,'v2']):
      doc = nlp(row)
      words = [token.text
              for token in doc
              if not token.is_stop and not token.is_punct]
      
    preprocessed_data.append(words)

  if option == 'nouns':
    for row in tqdm(data.loc[:, 'v2']):
      doc = nlp(row)
      nouns = [token.text
                for token in doc
                if (not token.is_stop and
                    not token.is_punct and
                    token.pos_ == "NOUN")]

  return preprocessed_data

train_data = preprocess_data(train_texts, option = 'all')
val_data = preprocess_data(val_texts, option = 'all')
test_data = preprocess_data(test_texts, option = 'all')

100%|██████████| 4737/4737 [01:03<00:00, 74.21it/s]
100%|██████████| 836/836 [00:09<00:00, 88.14it/s]
100%|██████████| 836/836 [00:09<00:00, 91.31it/s]


In [8]:
test_data

[['our',
  'mobile',
  'number',
  'has',
  'won',
  'å£5000',
  ',',
  'to',
  'claim',
  'calls',
  'us',
  'back',
  'or',
  'ring',
  'the',
  'claims',
  'hot',
  'line',
  'on',
  '09050005321',
  '.'],
 ['My',
  'friend',
  ',',
  'she',
  "'s",
  'studying',
  'at',
  'warwick',
  ',',
  'we',
  "'ve",
  'planned',
  'to',
  'go',
  'shopping',
  'and',
  'to',
  'concert',
  'tmw',
  ',',
  'but',
  'it',
  'may',
  'be',
  'canceled',
  ',',
  "havn't",
  'seen',
  ' ',
  'for',
  'ages',
  ',',
  'yeah',
  'we',
  'should',
  'get',
  'together',
  'sometime',
  '!'],
 ['yay',
  '!',
  'finally',
  'lol',
  '.',
  'i',
  'missed',
  'our',
  'cinema',
  'trip',
  'last',
  'week',
  ':-('],
 ['So', 'when', 'do', 'you', 'wanna', 'gym', 'harri'],
 ['To', 'day', 'class', 'is', 'there', 'are', 'no', 'class', '.'],
 ['Yeah',
  'no',
  'probs',
  '-',
  'last',
  'night',
  'is',
  'obviously',
  'catching',
  'up',
  'with',
  'you',
  '...',
  'Speak',
  'soon'],
 ['Yep', 'then'

In [9]:
import numpy as np
from collections import Counter
import re

class Vectorizer():
    def __init__(self, max_features):
        self.max_features = max_features
        self.vocab_list = None
        self.token_to_index = None

    def fit(self, dataset):
        # Create a vocab list, self.vocab_list, using the most frequent "max_features" tokens
        # Create a token indexer, self.token_to_index, that will map each token in self.vocab 
        # to its corresponding index in self.vocab_list
        """
        YOUR CODE GOES HERE
        """
        self.vocab_list = [col[0] for col in Counter([entry for sublist in dataset for entry in sublist]).most_common(self.max_features)]
        # self.vocab_list = word_freq(self.max_features)#.keys()
        self.token_to_index = {key: value for value, key in enumerate(self.vocab_list)}

    def transform(self, dataset):
        # This function transforms text dataset into a matrix, data_matrix
        """
        YOUR CODE GOES HERE
        """
        data_matrix = []
        for example in tqdm(dataset):
          data_row = np.zeros(len(self.vocab_list) + 2)

          
          for word in example:
            # Tokenization fitting
            if word in self.vocab_list:
              data_row[self.token_to_index[word]] = 1
            # Regex fitting for phone number feature and website link feature, this will serve as the Extra Credit work
            if re.search('\d{9,}', word):
              data_row[-2] = 1
            if re.search('(www)*\..+\.[a-z]{2,}', word):
              data_row[-1] = 1

          data_matrix.append(data_row)

        return data_matrix

In [10]:
max_features = 1000 # TODO: Replace None with a number

vectorizer = Vectorizer(max_features=max_features)
vectorizer.fit(train_data)

X_train = vectorizer.transform(train_data)
X_val = vectorizer.transform(val_data)
X_test = vectorizer.transform(test_data)

y_train = np.array(train_labels)
y_val = np.array(val_labels)
y_test = np.array(test_labels)

vocab = vectorizer.vocab_list

100%|██████████| 4737/4737 [00:00<00:00, 5483.83it/s]
100%|██████████| 836/836 [00:00<00:00, 5107.73it/s]
100%|██████████| 836/836 [00:00<00:00, 5151.21it/s]


(10 extra points) You can add more features to the feature matrix.

In [11]:
"""
YOUR CODE GOES HERE
"""
# Features added in the previous section, beginning line 36 of two blocks before this one

'\nYOUR CODE GOES HERE\n'

## Model

We train logistic regression model and save prediction for train, val and test.


In [12]:
from sklearn.linear_model import LogisticRegression

# Define Logistic Regression model
model = LogisticRegression(random_state=0, solver='liblinear')

# Fit the model to training data
model.fit(X_train, y_train)

# Make prediction using the trained model
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

  y = column_or_1d(y, warn=True)


## Performance of the model (30 points)

Your task is to report train, val, test accuracies and F1 scores. **You are required to implement `accuracy_score` and `f1_score` methods without using built-in python functions.** 

Your model should achieve at least **0.95** test accuracy and **0.90** test F1 score.

In [13]:
def accuracy_score(y_true, y_pred): 
    # Calculate accuracy of the model's prediction
    """
    YOUR CODE GOES HERE
    """
    TP, TN, FP, FN = 0, 0, 0, 0
    for index in range(len(y_true)):
        if y_pred[index] == 1 and y_true[index] == 1:
            TP += 1
        if y_pred[index] == 1 and y_true[index] == 0:
            FP += 1
        if y_pred[index] == 0 and y_true[index] == 1:
            FN += 1
        if y_pred[index] == 0 and y_true[index] == 0:
            TN += 1
    """
    pos = 0
    for index in range(len(y_pred)):
        if y_pred[index] == y_true[index]:
            pos += 1
    """    

    accuracy = (TP + TN)/(TP + TN + FP + FN)
    # accuracy = pos / len(y_pred)
    return accuracy

def f1_score(y_true, y_pred): 
    # Calculate F1 score of the model's prediction
    """
    YOUR CODE GOES HERE
    """
    TP, TN, FP, FN = 0, 0, 0, 0
    for index in range(len(y_true)):
        if y_pred[index] == 1 and y_true[index] == 1:
            TP += 1
        if y_pred[index] == 1 and y_true[index] == 0:
            FP += 1
        if y_pred[index] == 0 and y_true[index] == 1:
            FN += 1
        if y_pred[index] == 0 and y_true[index] == 0:
            TN += 1    
    f1 = TP / (TP + 0.5 * (FP + FN))
    return f1

In [14]:
print(f"Training accuracy: {accuracy_score(y_train, y_train_pred):.3f}, "
      f"F1 score: {f1_score(y_train, y_train_pred):.3f}")

print(f"Validation accuracy: {accuracy_score(y_val, y_val_pred):.3f}, "
      f"F1 score: {f1_score(y_val, y_val_pred):.3f}")

print(f"Test accuracy: {accuracy_score(y_test, y_test_pred):.3f}, "
      f"F1 score: {f1_score(y_test, y_test_pred):.3f}")

Training accuracy: 0.995, F1 score: 0.980
Validation accuracy: 0.987, F1 score: 0.939
Test accuracy: 0.998, F1 score: 0.992


**Question.**
Is accuracy the metric that logistic regression optimizes while training? If no, which metric is optimized in logistic regression?

**Your answer:**
No, the measure optimized is the objective function, usually the log odds of the outcome, y conditional to the input, x. Accuracy is used as a method of assessment, and will cause imblanced data

**Question.**
In general, does having 0.99 accuracy on test means that the model is great? If no, can you give an example of a case when the accuracy is high but the model is not good? (Hint: why do we use F1 score?)

**Your answer:**
No, having 0.99 accuracy means the predicted value matches the actual value 99% of all instances tested. We then need tests for cases of correct identification while minimizing for false negatives. For example, if out of 100 patients who are screened for cancer, 50 actually have it, but only 45 are identified. This gives an accuracy of 90%, but the false negative rate is too high, for a process that is highly concerning of correct correct positive, not just total identification.

## Exploration of predicitons (20 points)

Show a few examples with true+predicted labels on the train and val sets.

In [15]:
"""
YOUR CODE GOES HERE
"""
# 1 - spam, 0 - ham
train_examples = pd.DataFrame({'y_train_pred': y_train_pred, 
                               'y_train': y_train.T[0], 
                               'train_texts': train_texts['v2'].values.tolist()})
train_examples.head(100)

Unnamed: 0,y_train_pred,y_train,train_texts
0,1,1,"our mobile number has won å£5000, to claim cal..."
1,0,0,"My friend, she's studying at warwick, we've pl..."
2,0,0,yay! finally lol. i missed our cinema trip las...
3,0,0,So when do you wanna gym harri
4,0,0,To day class is there are no class.
...,...,...,...
95,0,0,You made my day. Do have a great day too.
96,0,0,"Camera quite good, 10.1mega pixels, 3optical a..."
97,1,1,"URGENT! Your mobile was awarded a å£1,500 Bonu..."
98,1,1,YOUR CHANCE TO BE ON A REALITY FANTASY SHOW ca...


In [16]:
val_examples = pd.DataFrame({'y_val_pred': y_val_pred, 
                             'y_val': y_val.T[0], 
                             'val_texts': val_texts['v2'].values.tolist()})
val_examples.head(100)

Unnamed: 0,y_val_pred,y_val,val_texts
0,0,0,Good. No swimsuit allowed :)
1,1,1,Urgent! call 09066350750 from your landline. Y...
2,0,0,Im sorry bout last nite it wasnåÕt ur fault it...
3,1,1,+123 Congratulations - in this week's competit...
4,0,0,Wish i were with you now!
...,...,...,...
95,0,0,Sorry da:)i was thought of calling you lot of ...
96,0,0,Nothing. Can...
97,0,0,Short But Cute : \ Be a good person
98,0,0,You will go to walmart. I.ll stay.


**Question** Print 10 examples from val set which were labeled incorrectly by the model. Why do you think the model got them wrong?

**Your answer:** Mostly phone numbers with spaces in between that needs to be acounted for in the RegEx implementation or capitilizations of names, words that were nouns misspelled for emphasis to a given reader, and exclaments. Furthermore, by introducing websites, false positives of actual websites were increased as well.

In [17]:
"""
YOUR CODE GOES HERE
"""
wrong_label = val_examples[(val_examples['y_val_pred'] != val_examples['y_val'])]
wrong_label

Unnamed: 0,y_val_pred,y_val,val_texts
31,0,1,SMS. ac sun0819 posts HELLO:\You seem cool
163,0,1,Hey I am really horny want to chat or see me n...
216,0,1,Email AlertFrom: Jeri StewartSize: 2KBSubject:...
271,0,1,As a registered optin subscriber ur draw 4 å£1...
329,0,1,dating:i have had two of these. Only started a...
363,0,1,Hi ya babe x u 4goten bout me?' scammers getti...
374,0,1,Monthly password for wap. mobsi.com is 391784....
379,0,1,Babe: U want me dont u baby! Im nasty and have...
460,0,1,Download as many ringtones as u like no restri...
654,0,1,You will be receiving this week's Triple Echo ...
