# Part 1.

The deadline for Part 1 is **1:30 pm Feb 6th, 2020**.   
You should submit a `.ipynb` file with your solutions to NYU Classes.

---


In this part we will preprocess SMS Spam Collection Dataset and train a bag-of-words classifier (logistic regression) for spam detection. 

## Data Loading

First, we download the SMS Spam Collection Dataset. The dataset is taken from [Kaggle](https://www.kaggle.com/uciml/sms-spam-collection-dataset/data#) and loaded to [Google Drive](https://drive.google.com/open?id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR) so that everyone can access it.

In [2]:
!wget 'https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR' -O spam.csv 

--2020-04-12 14:08:15--  https://docs.google.com/uc?export=download&id=1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR
Resolving docs.google.com (docs.google.com)... 172.217.8.206
Connecting to docs.google.com (docs.google.com)|172.217.8.206|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/nsgf8983eq0kc4k10jk93njtt62pto4b/1586714850000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download [following]
--2020-04-12 14:08:16--  https://doc-14-04-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/nsgf8983eq0kc4k10jk93njtt62pto4b/1586714850000/08752484438609855375/*/1OVRo37agn02mc6yp5p6-wtJ8Hyb-YMXR?e=download
Resolving doc-14-04-docs.googleusercontent.com (doc-14-04-docs.googleusercontent.com)... 216.58.192.193
Connecting to doc-14-04-docs.googleusercontent.com (doc-14-04-docs.googleusercontent.com)|216.58.192.193|:443... connected.
HT

In [3]:
!ls

Copy_of_HW_1_Part_1_Spam_Prediction.ipynb
Prelim_analysis-Copy1.ipynb
Preprocessing.ipynb
buffer.txt
spam.csv


There are two columns: `v1` -- spam or ham indicator, `v2` -- text of the message.

In [4]:
import pandas as pd
import numpy as np

df = pd.read_csv("spam.csv", usecols=["v1", "v2"], encoding='latin-1')
# 1 - spam, 0 - ham
df.v1 = (df.v1 == "spam").astype("int")
df.head()

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Your task is to split the data to train/dev/test. Make sure that each row appears only in one of the splits.

In [5]:
from sklearn import model_selection

In [6]:
# 0.15 for val, 0.15 for test, 0.7 for train
val_size = int(df.shape[0] * 0.15)
test_size = int(df.shape[0] * 0.15)
train_size = int(df.shape[0] * 0.70)

X_train, X_test, y_train, y_test = model_selection.train_test_split(
    df['v2'], df['v1'], test_size=0.15, random_state=1)

X_train, X_val, y_train, y_val = model_selection.train_test_split(
    X_train, y_train, test_size=(15/85), random_state=1)

train_texts, train_labels = X_train, y_train
val_texts, val_labels     = X_val, y_val
test_texts, test_labels   = X_test, y_test

#no need to check for correct split because train_test_split automatically does that. Duplicates in the data itself is
#a different story but is not asked to be checked.

In [29]:
train_texts

1193    Sex up ur mobile with a FREE sexy pic of Jorda...
5507               I want to be inside you every night...
801     Appt is at &lt;TIME&gt; am. Not my fault u don...
364     Good afternoon sunshine! How dawns that day ? ...
1775            Did u see what I posted on your Facebook?
2335    Which is weird because I know I had it at one ...
4207    Or i go home first lar Ì_ wait 4 me lor.. I pu...
2495    WINNER! As a valued network customer you hvae ...
4159    i felt so...not any conveying reason.. Ese he....
1098    NO GIFTS!! You trying to get me to throw mysel...
3626    Still chance there. If you search hard you wil...
4200    Wylie update: my weed dealer carlos went to fr...
138     You'll not rcv any more msgs from the chat svc...
1899                          I love working from home :)
3825    Goodmorning,my grandfather expired..so am on l...
2401    Babe: U want me dont u baby! Im nasty and have...
4483         Shopping? Eh ger i toking abt syd leh...Haha
3829    I agre

## Data Processing

The task is to create bag-of-words features: tokenize the text, index each token, represent the sentence as a dictionary of tokens and their counts, limit the vocabulary to $n$ most frequent tokens. In the lab we use built-in `sklearn` function, `sklearn.feature_extraction.text.CountVectorizer`. 
**In this HW, you are required to implement the `Vectorizer` on your own without using `sklearn` built-in functions.**

Function `preprocess_data` takes the list of texts and returns list of (lists of tokens). 
You may use [spacy](https://spacy.io/) or [nltk](https://www.nltk.org/) text processing libraries in `preprocess_data` function. 

Class `Vectorizer` is used to vectorize the text and to create a matrix of features.

In [10]:
import spacy
import en_core_web_sm

In [24]:
from tqdm import tqdm

def preprocess_data(data):
    
    preprocessed_data = []
    nlp = en_core_web_sm.load()
    
    #new to spacy, decided to not change any settings
    for text in tqdm(data):
        doc = nlp(text)
        new_list = []
        for token in doc:
            new_list.append(token.text)
        preprocessed_data.append(new_list)

    return preprocessed_data

train_data = preprocess_data(train_texts)
val_data = preprocess_data(val_texts)
test_data = preprocess_data(test_texts)

100%|██████████| 3900/3900 [01:14<00:00, 52.26it/s]
100%|██████████| 836/836 [00:17<00:00, 47.99it/s]
100%|██████████| 836/836 [00:19<00:00, 42.61it/s]


In [25]:
def preprocess_data(data):
    
    preprocessed_data = []
    nlp = en_core_web_sm.load()
    
    #new to spacy, decided to not change any settings
    for text in tqdm(data):
        doc = nlp(text)
        new_list = [token.text for token in doc]
        preprocessed_data.append(new_list)

    return preprocessed_data

train_data = preprocess_data(train_texts)
val_data = preprocess_data(val_texts)
test_data = preprocess_data(test_texts)

100%|██████████| 3900/3900 [01:16<00:00, 51.15it/s]
100%|██████████| 836/836 [00:18<00:00, 55.90it/s]
100%|██████████| 836/836 [00:14<00:00, 57.93it/s]


In [33]:
def preprocess_data(data):
    
    nlp = en_core_web_sm.load()
    
    #new to spacy, decided to not change any settings
    preprocessed_data = [[token.text for token in nlp(text)] for text in tqdm(data)]

    return preprocessed_data

train_data = preprocess_data(train_texts)
val_data = preprocess_data(val_texts)
test_data = preprocess_data(test_texts)

100%|██████████| 3900/3900 [01:12<00:00, 59.35it/s]
100%|██████████| 836/836 [00:16<00:00, 49.49it/s]
100%|██████████| 836/836 [00:17<00:00, 46.47it/s]


In [34]:
train_data

[['Sex',
  'up',
  'ur',
  'mobile',
  'with',
  'a',
  'FREE',
  'sexy',
  'pic',
  'of',
  'Jordan',
  '!',
  'Just',
  'text',
  'BABE',
  'to',
  '88600',
  '.',
  'Then',
  'every',
  'wk',
  'get',
  'a',
  'sexy',
  'celeb',
  '!',
  'PocketBabe.co.uk',
  '4',
  'more',
  'pics',
  '.',
  '16',
  'å£3/wk',
  '087016248'],
 ['I', 'want', 'to', 'be', 'inside', 'you', 'every', 'night', '...'],
 ['Appt',
  'is',
  'at',
  '&',
  'lt;TIME&gt',
  ';',
  'am',
  '.',
  'Not',
  'my',
  'fault',
  'u',
  'do',
  "n't",
  'listen',
  '.',
  'I',
  'told',
  'u',
  'twice'],
 ['Good',
  'afternoon',
  'sunshine',
  '!',
  'How',
  'dawns',
  'that',
  'day',
  '?',
  'Are',
  'we',
  'refreshed',
  'and',
  'happy',
  'to',
  'be',
  'alive',
  '?',
  'Do',
  'we',
  'breathe',
  'in',
  'the',
  'air',
  'and',
  'smile',
  '?',
  'I',
  'think',
  'of',
  'you',
  ',',
  'my',
  'love',
  '...',
  'As',
  'always'],
 ['Did', 'u', 'see', 'what', 'I', 'posted', 'on', 'your', 'Facebook', '

In [27]:
!pip install cython



In [28]:
%load_ext cython

In [36]:
%%cython -+
import numpy # Sometime we have a fail to import numpy compilation error if we don't import numpy
from cymem.cymem cimport Pool
from spacy.tokens.doc cimport Doc
from spacy.typedefs cimport hash_t
from spacy.structs cimport TokenC

cdef struct DocElement:
    TokenC* c
    int length

cdef int fast_loop(DocElement* docs, int n_docs, hash_t word, hash_t tag):
    cdef int n_out = 0
    for doc in docs[:n_docs]:
        for c in doc.c[:doc.length]:
            if c.lex.lower == word and c.tag == tag:
                n_out += 1
    return n_out

def main_nlp_fast(doc_list):
    cdef int i, n_out, n_docs = len(doc_list)
    cdef Pool mem = Pool()
    cdef DocElement* docs = <DocElement*>mem.alloc(n_docs, sizeof(DocElement))
    cdef Doc doc
    for i, doc in enumerate(doc_list): # Populate our database structure
        docs[i].c = doc.c
        docs[i].length = (<Doc>doc).length
    word_hash = doc.vocab.strings.add('run')
    tag_hash = doc.vocab.strings.add('NN')
    n_out = fast_loop(docs, n_docs, word_hash, tag_hash)
    print(n_out)

CompileError: command 'gcc' failed with exit status 1

In [14]:
import numpy as np

class Vectorizer():
    def __init__(self, max_features):
        self.max_features = max_features
        self.vocab_list = None
        self.token_to_index = None

    def fit(self, dataset):
        # Create a vocab list, self.vocab_list, using the most frequent "max_features" tokens
        
        vocab_master = np.array([ elem for row in dataset for elem in row])
        
        word_list,count = np.unique(vocab_master,return_counts = True)
        word_list_sorted = word_list[np.argsort(-count)]
        
        self.vocab_list = word_list_sorted[:self.max_features]

        # Create a token indexer, self.token_to_index, that will return index of the token in self.vocab_list
        self.token_to_index = {}

        for i,word in enumerate(self.vocab_list):
          self.token_to_index[word] = i
        
        pass

    def transform(self, dataset):
        # This function transforms text dataset into a matrix, data_matrix
        """
        YOUR CODE GOES HERE
        """
        #dictionary, append count of words only in that row
        data_matrix = np.zeros((len(dataset), len(self.vocab_list)))
        for i,row in enumerate(dataset):
          for word in row:
            data_matrix[i][self.token_to_index.get(word)] += 1
  
        
        return data_matrix

In [15]:
max_features = 2000 # TODO: Replace None with a number
vectorizer = Vectorizer(max_features=max_features)
vectorizer.fit(train_data)
X_train = vectorizer.transform(train_data)
X_val = vectorizer.transform(val_data)
X_test = vectorizer.transform(test_data)

y_train = np.array(train_labels)
y_val = np.array(val_labels)
y_test = np.array(test_labels)

vocab = vectorizer.vocab_list

In [16]:
X_train

array([[8., 7., 6., ..., 6., 6., 6.],
       [0., 1., 1., ..., 0., 0., 0.],
       [3., 1., 2., ..., 1., 1., 2.],
       ...,
       [5., 4., 4., ..., 4., 4., 4.],
       [2., 1., 1., ..., 1., 1., 1.],
       [2., 2., 2., ..., 2., 2., 2.]])

You can add more features to the feature matrix.

In [0]:
"""
YOUR CODE GOES HERE
"""

'\nYOUR CODE GOES HERE\n'

## Model

We train logistic regression model and save prediction for train, val and test.


In [17]:
from sklearn.linear_model import LogisticRegression

# Define Logistic Regression model
model = LogisticRegression(random_state=0, solver='liblinear')

# Fit the model to training data
model.fit(X_train, y_train)

# Make prediction using the trained model
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

In [18]:
y_train_pred

array([1, 0, 0, ..., 0, 0, 0])

## Performance of the model

Your task is to report train, val, test accuracies and F1 scores.
**You are required to implement `accuracy_score` and `f1_score` methods without using built-in python functions.**

Your model should achieve at least **0.95** test accuracy and **0.90** test F1 score.

In [19]:
def accuracy_score(y_true, y_pred): 
    # Calculate accuracy of the model's prediction
    
    sum_error = 0
    for i in range(0,len(y_pred)):
      error = (y_pred[i] - y_true[i]) ** 2
      sum_error += error
    
    accuracy = 1 - (sum_error / len(y_pred))
    
    return accuracy

def f1_score(y_true, y_pred): 
    # Calculate F1 score of the model's prediction

    tp = 0
    fp = 0
    fn = 0

    for i in range(0,len(y_pred)):
      if y_pred[i] == 1 and y_true[i] == 1:
        tp += 1
      if y_pred[i] == 1 and y_true[i] == 0:
        fp += 1
      if y_pred[i] == 0 and y_true[i] == 1:
        fn += 1
    
    prec = tp / (tp + fp)
    rec = tp / (tp + fn)

    f1 = 2 * prec * rec / (prec + rec)
    return f1

In [20]:
print(f"Training accuracy: {accuracy_score(y_train, y_train_pred):.3f}, "
      f"F1 score: {f1_score(y_train, y_train_pred):.3f}")
print(f"Validation accuracy: {accuracy_score(y_val, y_val_pred):.3f}, "
      f"F1 score: {f1_score(y_val, y_val_pred):.3f}")
print(f"Test accuracy: {accuracy_score(y_test, y_test_pred):.3f}, "
      f"F1 score: {f1_score(y_test, y_test_pred):.3f}")

Training accuracy: 0.994, F1 score: 0.978
Validation accuracy: 0.982, F1 score: 0.934
Test accuracy: 0.977, F1 score: 0.909


**Question.**
Is accuracy the metric that logistic regression optimizes while training? If no, which metric is optimized in logistic regression?

**Your answer:** Yes, you can see that the training accuracy is very high. From my understanding of LogReg, it trains for the data points to be correctly classified which is accuracy.

**Question.**
In general, does having 0.99 accuracy on test means that the model is great? If no, can you give an example of a case when the accuracy is high but the model is not good? (Hint: why do we use F1 score?)

**Your answer:** 99% accuracy might be indicative of duplicates in train and test splits which results in overfitting. Another situation is when sets are imbalanced and there is an uneven distribution. In this case F1 Score might be a better measure to use to ensure a balance between Precision and Recall, especially since False Negatives and False Positives are crucial to the question ("spam" into mailbox and "ham" into spam box have negative effects).

### Exploration of predicitons

Show a few examples with true+predicted labels on the train and val sets.

In [0]:
for i in range(0,20):
  if y_train_pred[i] == 1 and y_train[i] == 1:
    print(train_texts[i])

Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
URGENT! You have won a 1 week FREE membership in our å£100,000 Prize Jackpot! Txt the word: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18
XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL


In [0]:
new_val_texts = val_texts.reset_index()['v2']

for i in range(0,20):
  if y_val_pred[i] == 1 and y_val[i] == 1:
    print(new_val_texts[i])

Sunshine Hols. To claim ur med holiday send a stamped self address envelope to Drinks on Us UK, PO Box 113, Bray, Wicklow, Eire. Quiz Starts Saturday! Unsub Stop
You have WON a guaranteed å£1000 cash or a å£2000 prize.To claim yr prize call our customer service representative on
URGENT! Your mobile number *************** WON a å£2000 Bonus Caller prize on 10/06/03! This is the 2nd attempt to reach you! Call 09066368753 ASAP! Box 97N7QP, 150ppm
Your account has been credited with 500 FREE Text Messages. To activate, just txt the word: CREDIT to No: 80488 T&Cs www.80488.biz


**Question** Print 10 examples from val set which were labeled incorrectly by the model. Why do you think the model got them wrong?

**Your answer:** 

In [0]:
for i in range(0,500):
  if y_val_pred[i] == 1 and y_val[i] == 0:
    print(new_val_texts[i], y_val_pred[i])
  if y_val_pred[i] == 0 and y_val[i] == 1:
    print(new_val_texts[i], y_val_pred[i])

Hi babe its Jordan, how r u? Im home from abroad and lonely, text me back if u wanna chat xxSP visionsms.com Text stop to stopCost 150p 08712400603 0
Hi its LUCY Hubby at meetins all day Fri & I will B alone at hotel U fancy cumin over? Pls leave msg 2day 09099726395 Lucy x Callså£1/minMobsmoreLKPOBOX177HP51FL 0
RCT' THNQ Adrian for U text. Rgds Vatian 0
In The Simpsons Movie released in July 2007 name the band that died at the start of the film? A-Green Day, B-Blue Day, C-Red Day. (Send A, B or C) 0
Win a å£1000 cash prize or a prize worth å£5000 0
Download as many ringtones as u like no restrictions, 1000s 2 choose. U can even send 2 yr buddys. Txt Sir to 80082 å£3  0
Can you tell Shola to please go to college of medicine and visit the academic department, tell the academic secretary what the current situation is and ask if she can transfer there. She should ask someone to check Sagamu for the same thing and lautech. Its vital she completes her medical education in Nigeria. Its less 

The model got them wrong because they are cases not in the training data. 1 and 2 seem personal at the beginning which is why the model got it wrong. 3 and 4 have very unique words so the model didn't see it as spam, for instance 4 has names of movies and bands. 5 and 6 were classified as not spam because of the symbols. 7 was misclassified as spam because the necessary action and countries in the sentence. 8,9,10 I dont see a direct reason as to why they are misclassified but it is probably the same reasons as the previous 7... if the words are not recognized in the training class as spam then when testing they wont be recognized.

## End of Part 1.
