The deadline is 9:30am Feb 9th (Wed).   
You should submit a `.ipynb` file with your solutions to BrightSpace.

--- 

There are 10 extra points for "adding extra features to your model". But the maximum grade you can obtain in this homework is 100%. If you complete the extra-credit task, your score will be min{10+score, 100}.

---


In this homework we will preprocess SMS Spam Collection Dataset and train a bag-of-words classifier (logistic regression) for spam detection. 

## Data Loading (10 points)

First, we download the SMS Spam Collection Dataset. The dataset is taken from [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset/data#) and loaded to [Google Drive](https://drive.google.com/open?id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR) so that everyone can access it.

In [4]:
!wget 'https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR' -O spam.csv 

/bin/sh: wget: command not found


There are two columns: `v1` -- spam or ham indicator, `v2` -- text of the message.

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv("spam.csv", usecols=["v1", "v2"], encoding='latin-1')
# 1 - spam, 0 - ham
df.v1 = (df.v1 == "spam").astype("int")
df.head()

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
shuffle = df.sample(frac=1) #use this to shuffle the dataset
shuffle

Unnamed: 0,v1,v2
3700,0,Shall i get my pouch?
2329,0,Am surfing online store. For offers do you wan...
2164,0,"Nothing really, just making sure everybody's u..."
5357,0,Ok
3096,0,"Yo, you at jp and hungry like a mofo?"
...,...,...
775,0,Thanks for picking up the trash.
3725,0,No chikku nt yet.. Ya i'm free
4892,0,Send me the new number
3267,0,Which is why i never wanted to tell you any of...


Your task is to split the data to train/dev/test (don't forget to shuffle the data). Make sure that each row appears only in one of the splits.

In [7]:
# 0.15 for val, 0.15 for test, 0.7 for train
val_size = int(df.shape[0] * 0.15)
test_size = int(df.shape[0] * 0.15)


# splitting the data
val = shuffle[:val_size]
test = shuffle[val_size:val_size + test_size]
train = shuffle[val_size + test_size:]


train_texts, train_labels = train[:]["v2"], train[:]["v1"]
val_texts, val_labels     = val[:]["v2"], val[:]["v1"]
test_texts, test_labels   = test[:]["v2"], test[:]["v1"]

## Data Processing (40 points)

The task is to create bag-of-words features: tokenize the text, index each token, represent the sentence as a dictionary of tokens and their counts, limit the vocabulary to $n$ most frequent tokens. In the lab we use built-in `sklearn` function, `sklearn.feature_extraction.text.CountVectorizer`. 
**In this HW, you are required to implement the `Vectorizer` on your own without using `sklearn` built-in functions.**

Function `preprocess_data` takes the list of texts and returns list of (lists of tokens). 
You may use [spacy](https://spacy.io/) or [nltk](https://www.nltk.org/) text processing libraries in `preprocess_data` function. 

Class `Vectorizer` is used to vectorize the text and to create a matrix of features.


In [9]:
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/seonhyeyang/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [67]:
def preprocess_data(data):
    # This function should return a list of lists of preprocessed tokens for each message
    
    nltk.download("punkt")
    result = []
    for word in data:
        result.append(nltk.word_tokenize(word))
    
    preprocessed_data = result
    return preprocessed_data

train_data = preprocess_data(train_texts)
val_data = preprocess_data(val_texts)
test_data = preprocess_data(test_texts)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/seonhyeyang/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/seonhyeyang/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/seonhyeyang/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [12]:
import numpy as np

class Vectorizer():
    def __init__(self, max_features):
        self.max_features = max_features
        self.vocab_list = None
        self.token_to_index = None

    def fit(self, dataset):
        # Create a vocab list, self.vocab_list, using the most frequent "max_features" tokens
        # Create a token indexer, self.token_to_index, that will map each token in self.vocab 
        # to its corresponding index in self.vocab_list
        d = []
        for s in dataset:
            for t in s:
                d.append(t)
        
        
        w, c = np.unique(d, return_counts=True)
        
        self.vocab_list = []
        for i, w in sorted(zip(c, w), reverse=True):
            self.vocab_list.append(w)
        self.token_to_index = {}
        for i, w in enumerate(self.vocab_list):
            self.token_to_index[w] = i
        
        
    def transform(self, dataset):
        # This function transforms text dataset into a matrix, data_matrix
        data_matrix = np.zeros((len(dataset), len(self.vocab_list)))
        
        for i, d in enumerate(dataset):
            for t in d:
                if t in self.token_to_index:
                    data_matrix[i, self.token_to_index[t]] = 1
        
        
        return data_matrix

In [60]:
max_features = 750 # TODO: Replace None with a number
vectorizer = Vectorizer(max_features=max_features)
vectorizer.fit(train_data)
X_train = vectorizer.transform(train_data)
X_val = vectorizer.transform(val_data)
X_test = vectorizer.transform(test_data)

y_train = np.array(train_labels)
y_val = np.array(val_labels)
y_test = np.array(test_labels)

vocab = vectorizer.vocab_list

(10 extra points) You can add more features to the feature matrix.

In [None]:
"""
YOUR CODE GOES HERE
"""

## Model

We train logistic regression model and save prediction for train, val and test.


In [14]:
from sklearn.linear_model import LogisticRegression

# Define Logistic Regression model
model = LogisticRegression(random_state=0, solver='liblinear')

# Fit the model to training data
model.fit(X_train, y_train)

# Make prediction using the trained model
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  indices = (scores > 0).astype(np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  indices = (scores > 0).astype(np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  indices = (scores > 0).astype(np.int)


## Performance of the model (30 points)

Your task is to report train, val, test accuracies and F1 scores. **You are required to implement `accuracy_score` and `f1_score` methods without using built-in python functions.** 

Your model should achieve at least **0.95** test accuracy and **0.90** test F1 score.

In [68]:
def accuracy_score(y_true, y_pred): 
    # Calculate accuracy of the model's prediction
    accuracy = (y_true == y_pred).sum()  / len(y_true) #correct predictions/#total predictions
    return accuracy

def f1_score(y_true, y_pred): 
    # Calculate F1 score of the model's prediction
    t = ((y_true == 1) & (y_pred == 1)).sum()
    p = (y_pred == 1).sum()
    ps = (y_true == 1).sum()
    pres = t / p
    if ps <= 0:
        pres = 0
    r = t / ps
    if ps <= 0:
        r = 0
        
    f1 = 2 / ((1 / pres) + (1/r))
    if pres <= 0 or r <=0:
        f1 = 0
    return f1

In [69]:
print(f"Training accuracy: {accuracy_score(y_train, y_train_pred):.3f}, "
      f"F1 score: {f1_score(y_train, y_train_pred):.3f}")
print(f"Validation accuracy: {accuracy_score(y_val, y_val_pred):.3f}, "
      f"F1 score: {f1_score(y_val, y_val_pred):.3f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred):.3f}, "
      f"F1 score: {f1_score(y_test, y_test_pred):.3f}")

Training accuracy: 0.998, F1 score: 0.994
Validation accuracy: 0.978, F1 score: 0.912
Test accuracy: 0.989, F1 score: 0.954


**Question.**
Is accuracy the metric that logistic regression optimizes while training? If no, which metric is optimized in logistic regression?

**Your answer:** logistic regression is a statistical model that uses logistic function to model a binary output such as pass/fail. Logistic regression minimizes the negative log of the likelihood of the correct output and maximizes the liklihood of the correct output.

**Question.**
In general, does having 0.99 accuracy on test means that the model is great? If no, can you give an example of a case when the accuracy is high but the model is not good? (Hint: why do we use F1 score?)

**Your answer:** Just because a model has 0.99 accuracy on a test that does not mean the model is ideal or great. For example, if we had a dataset of 1000 students based on sex (male or female) and the ratio was 990:10 male to female respectively. Then we would have a high accuracy on males but a low accuracy for females. It's important that our data has balance. 

### Exploration of predicitons (20 points)

Show a few examples with true+predicted labels on the train and val sets.

In [64]:
# 1 - spam, 0 - ham
t = 1
f = 0
tcs = train_texts.iloc[np.where((y_train == y_train_pred) & (y_train == t))]
tch = train_texts.iloc[np.where((y_train == y_train_pred) & (y_train == f))]
vcs = val_texts.iloc[np.where((y_val == y_val_pred) & (y_val == t))]
vch = val_texts.iloc[np.where((y_val == y_val_pred) & (y_val == f))]

In [65]:
#spam
for i in range(10):
    print(tcs.iloc[i])
    print(vcs.iloc[i])

07732584351 - Rodger Burns - MSG = We tried to call you re your reply to our sms for a free nokia mobile + free camcorder. Please call now 08000930705 for delivery tomorrow
Congrats! 1 year special cinema pass for 2 is yours. call 09061209465 now! C Suprman V, Matrix3, StarWars3, etc all 4 FREE! bx420-ip4-5we. 150pm. Dont miss out! 
Kit Strip - you have been billed 150p. Netcollex Ltd. PO Box 1013 IG11 OJA
Congratulations YOU'VE Won. You're a Winner in our August å£1000 Prize Draw. Call 09066660100 NOW. Prize Code 2309.
Do you want a New Nokia 3510i Colour Phone Delivered Tomorrow? With 200 FREE minutes to any mobile + 100 FREE text + FREE camcorder Reply or Call 8000930705
Todays Voda numbers ending 7548 are selected to receive a $350 award. If you have a match please call 08712300220 quoting claim code 4041 standard rates app
PRIVATE! Your 2003 Account Statement for shows 800 un-redeemed S. I. M. points. Call 08718738002 Identifier Code: 48922 Expires 21/11/04
No. 1 Nokia Tone 4 ur m

In [66]:
#ham
for i in range(10):
    print(tch.iloc[i])
    print(vch.iloc[i])

Its  &lt;#&gt; k here oh. Should i send home for sale.
Nice line said by a broken heart- Plz don't cum 1 more times infront of me... Other wise once again I ll trust U... Good 9t:)
Awesome, plan to get here any time after like  &lt;#&gt; , I'll text you details in a wee bit
Did u find out what time the bus is at coz i need to sort some stuff out.
This pay is  &lt;DECIMAL&gt;  lakhs:)
Hi i won't b ard 4 christmas. But do enjoy n merry x'mas.
I think its far more than that but find out. Check google maps for a place from your dorm.
Come to mahal bus stop.. &lt;DECIMAL&gt;
Good. do you think you could send me some pix? I would love to see your top and bottom...
Your pussy is perfect!
Hi.:)technical support.providing assistance to us customer through call and email:)
My supervisor find 4 me one lor i thk his students. I havent ask her yet. Tell u aft i ask her.
Well am officially in a philosophical hole, so if u wanna call am at home ready to be saved!
Wat r u doing now?
Ultimately tor mot

**Question** Print 10 examples from val set which were labeled incorrectly by the model. Why do you think the model got them wrong?

**Your answer:** 

In [63]:
val = np.array(val_texts)

x = 0
num = 10
for i in range(num):
  if y_val[x] == y_val_pred[x]:
    x += 1
  print("labeled incorrectly by the model:", val[x])

labeled incorrectly by the model: Nice line said by a broken heart- Plz don't cum 1 more times infront of me... Other wise once again I ll trust U... Good 9t:)
labeled incorrectly by the model: Did u find out what time the bus is at coz i need to sort some stuff out.
labeled incorrectly by the model: Hi i won't b ard 4 christmas. But do enjoy n merry x'mas.
labeled incorrectly by the model: Come to mahal bus stop.. &lt;DECIMAL&gt;
labeled incorrectly by the model: Your pussy is perfect!
labeled incorrectly by the model: My supervisor find 4 me one lor i thk his students. I havent ask her yet. Tell u aft i ask her.
labeled incorrectly by the model: Wat r u doing now?
labeled incorrectly by the model: She is our sister.. She belongs 2 our family.. She is d hope of tomorrow.. Pray 4 her,who was fated 4 d Shoranur train incident. Lets hold our hands together &amp; fuelled by love &amp; concern prior 2 her grief &amp; pain. Pls join in dis chain &amp; pass it. STOP VIOLENCE AGAINST WOMEN.
l