## Hi, sklearn!

## Reading spam collection

In [1]:
!head -n 40 ./data/1-sms-spam-train.txt

ham	Did he say how fantastic I am by any chance, or anything need a bigger life lift as losing the will 2 live, do you think I would be the first person 2 die from N V Q? 
ham	Black shirt n blue jeans... I thk i c ü...
ham	If e timing can, then i go w u lor...
ham	They r giving a second chance to rahul dengra.
ham	I cant pick the phone right now. Pls send a message
ham	Haha good to hear, I'm officially paid and on the market for an 8th
ham	Ffffffffff. Alright no way I can meet up with you sooner?
ham	But i'm really really broke oh. No amount is too small even  &lt;#&gt; 
ham	Only 2% students solved this CAT question in 'xam... 5+3+2= &lt;#&gt;  9+2+4= &lt;#&gt;  8+6+3= &lt;#&gt;  then 7+2+5=????? Tell me the answer if u r brilliant...1thing.i got d answr.
spam	<Forwarded from 21870000>Hi - this is your Mailbox Messaging SMS alert. You have 4 messages. You have 21 matches. Please call back on 09056242159 to retrieve your messages and matches
ham	No da:)he is stupid da..always 

In [2]:
import codecs

with codecs.open('./data/1-sms-spam-train.txt') as f:
    labels, messages = zip(*[line.split('\t') for line in f.readlines()])

#### read test dataset

In [3]:
with codecs.open('./data/1-sms-spam-test.txt') as f:
    kaggle_test_messages = f.readlines()

#### prepare solution

In [4]:
import numpy

In [5]:
import pandas
from IPython.display import FileLink

def create_solution(predictions, filename='1-sms-spam-predictions.csv'):
    result = pandas.DataFrame({'Id': numpy.arange(len(predictions)), 'Label': predictions})
    result.to_csv('data/{}'.format(filename), index=False)
    return FileLink('data/{}'.format(filename))

In [6]:
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

In [7]:
def compute_data_expressions(messages):
    features = []
    # length of each string
    features.append(map(len, messages))
    
    # number of letters, digits, spaces = words
    for pattern in [str.isalpha, str.isdigit, str.isspace]:
        features.append(map(lambda message: sum(map(pattern, message)), messages))
        
    features = numpy.array(features).T
    return features

features = compute_data_expressions(messages)
kaggle_test_features = compute_data_expressions(kaggle_test_messages)

answers = numpy.array(labels) == 'spam' 

In [8]:
features

array([[168, 124,   2,  39],
       [ 44,  26,   0,  10],
       [ 38,  24,   0,  10],
       ..., 
       [ 31,  22,   0,   6],
       [175, 119,  21,  28],
       [ 25,  20,   0,   5]])

In [9]:
from sklearn.neighbors import KNeighborsClassifier
# area under the roc curve
from sklearn.metrics import roc_auc_score
knn_clf = KNeighborsClassifier(n_neighbors=1)
knn_clf.fit(features, answers)
roc_auc_score(answers, knn_clf.predict_proba(features)[:, 1])

0.997237808402064

In [10]:
create_solution(knn_clf.predict_proba(kaggle_test_features)[:, 1])

In [11]:
trainX, testX, trainY, testY = train_test_split(features, answers, random_state=42)

## Knn

In [12]:
knn_clf = KNeighborsClassifier(n_neighbors=1)
knn_clf.fit(trainX, trainY)
print 'test', roc_auc_score(testY, knn_clf.predict_proba(testX)[:, 1])
# print 'train', roc_auc_score(trainY, knn_clf.predict_proba(trainX)[:, 1])

test 0.935098650052


## Finding optimal number of neighbours:

In [13]:
for n_neighbors in [1, 2, 4, 8, 16, 32, 64]:
    knn_clf = KNeighborsClassifier(n_neighbors=n_neighbors)
    knn_clf.fit(trainX, trainY)
    print n_neighbors, roc_auc_score(testY, knn_clf.predict_proba(testX)[:, 1])

1 0.935098650052
2 0.953595534787
4 0.967989211953
8 0.968775239414
16 0.976476866274
32 0.981092073382
64 0.974277431637


### what happens if the metric is changed?

In [14]:
knn_clf = KNeighborsClassifier(metric='canberra', n_neighbors=20)
knn_clf.fit(trainX, trainY)
print roc_auc_score(testY, knn_clf.predict_proba(testX)[:, 1])
print roc_auc_score(trainY, knn_clf.predict_proba(trainX)[:, 1])

0.983277085497
0.989762131544


## Bag of words

In [15]:
vectorizer = CountVectorizer()
vectorizer.fit(messages)
counts = vectorizer.transform(messages).toarray()
test_counts = vectorizer.transform(kaggle_test_messages).toarray()

In [16]:
counts.shape

(3000, 6294)

In [17]:
# vocabulary is dictionary which keeps correspondence between columns and words
# vectorizer.vocabulary_

In [18]:
trainX, testX, trainY, testY = train_test_split(counts, answers, random_state=42)

## Naive Bayes

#### gaussian

In [19]:
from sklearn.naive_bayes import GaussianNB
nb_clf = GaussianNB()
nb_clf.fit(trainX, trainY)
roc_auc_score(testY, nb_clf.predict_proba(testX)[:, 1])

0.88849948078920049

#### multinomial

In [20]:
from sklearn.naive_bayes import MultinomialNB
nb_clf = MultinomialNB()
nb_clf.fit(trainX, trainY)
roc_auc_score(testY, nb_clf.predict_proba(testX)[:, 1])

0.97836621668397372

In [21]:
trainX.shape

(2250, 6294)

## Linear regression + Ridge regularization

In [22]:
from sklearn.linear_model import Ridge

In [23]:
ridge_clf = Ridge()
ridge_clf.fit(trainX, trainY)
print roc_auc_score(testY, ridge_clf.predict(testX))
print roc_auc_score(trainY, ridge_clf.predict(trainX))

0.989976347064
1.0


** Exercise #0.** Play with regularization parameter of RidgeRegression, see how it affects quality on train and test.
Check quality of best model by submitting to kaggle.


**Exercise #1.** Let's write the correspondence between columns and words (done below). Which words are most popular?

In [24]:
dictionary = numpy.empty(len(vectorizer.vocabulary_), dtype='O')
for word, index in vectorizer.vocabulary_.iteritems():
    dictionary[index] = word

In [25]:
dictionary

array([u'00', u'000', u'000pes', ..., u'zyada', u'\xe8n', u'\xfa1'], dtype=object)

** Exercise #2. ** By analyzing coefficients in `ridge_clf.coef_`, determine which words have the highest impact on decision (= have the largest modulus of `coef_`)

** Exercise #3. **  Does combining features and counts improve quality? Use `numpy.hstack` to concatenate arrays.
Explain the result.

** Exercise #4.** Print examples on which your classifier makes mistakes (both false positive and false negative).

This is important step to understand what can be done to improve the classifier

** Exercise #5. (optional, just for fun)**  write a spam SMS, which is not caught by your best model. 
Something like "Send sms YES to 091231323 to activate amazing spam filter, FREE for two weeks, then 20p/day. Txt now!".

Use your knowledge about the structure of the model.

** Major Goal (not in the homework). ** Provide best classification model for the problem. 

You can start with computing new features:
1. Computing occurences of symbols
2. Ignoring the words with digits, dots, etc.
3. Detect links, phones in text

Or start with changing parameters of classifiers. 