# Sentence Similarity Measures III: Wide-Inclusive Sentence Featurization

## 0. Contents

* I. Corpora:
    * MSR Paraphrase Corpus
    * Brown
* II. Discriminativity Weighting (Brown, SpaCy lemmatization)
* III. Featurization:
    * Features:
        * Unigram Prec/Rec (Wan et al. 2006) 
        * Bleu Prec/Rec (Papineni et al. 2002)
        * Dependency Prec/Rec (Wan et al. 2006; Moll$\acute{a}$ 2003; Hovy et al. 2015)
        * F1 for Unigram, Bleu & Dependency
        * Tree Edit Distance (Zhang & Sasha Algorithm)
        * Sentence Lengths (Wan et al. 2006)
    * Featurization Function
* IV. Paraphrase Classifier:
    * Training: MSR Paraphrase Corpus
    * Classifier Types:
        * Logistic
        * SVM
* V. Evaluation

## I. Corpora

In [1]:
import numpy as np
import pandas as pd
from nltk.corpus import brown
from spacy.en import English
from collections import defaultdict

In [2]:
parser = English()

##### Load MSR

In [3]:
train_path = "/Users/jacobsw/Desktop/WORK/OJO_CODE/SENTENCE_SIMILARITIES/CORPORA/paraphrase/msr_paraphrase_train.txt"
test_path = "/Users/jacobsw/Desktop/WORK/OJO_CODE/SENTENCE_SIMILARITIES/CORPORA/paraphrase/msr_paraphrase_test.txt"

In [4]:
df_train = pd.read_csv(train_path, delimiter='\t')
df_test = pd.read_csv(test_path, delimiter='\t')
df_train.head()

Unnamed: 0,﻿Quality,#1 ID,#2 ID,#1 String,#2 String
0,1,702876,702977,"Amrozi accused his brother, whom he called the...","Referring to him as only the witness, Amrozi a..."
1,0,2108705,2108831,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...
2,1,1330381,1330521,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an..."
3,0,3344667,3344648,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ..."
4,1,1236820,1236712,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...


In [5]:
print df_train.shape
print df_test.shape

(4076, 5)
(1725, 5)


In [6]:
df_train.ix[0] # NB: index Quality is actually weirdly '﻿Quality'.

﻿Quality                                                     1
#1 ID                                                   702876
#2 ID                                                   702977
#1 String    Amrozi accused his brother, whom he called the...
#2 String    Referring to him as only the witness, Amrozi a...
Name: 0, dtype: object

In [172]:
df_train.ix[0]['#1 String']

'Amrozi accused his brother, whom he called the witness, of deliberately distorting his evidence.'

##### Make Train/Test

In [173]:
dep_lemmas = lambda parsed_s: {(token.head.lemma_,token.lemma_) for token in parsed_s
                             if token.head.lemma_!=token.lemma_} # eliminte (v, ROOT, v) cases
dep_tokens = lambda parsed_s: {(token.head.orth_,token.orth_) for token in parsed_s
                             if token.head.lemma_!=token.lemma_}

In [223]:
get_root = lambda parsed_s: [token for token in parsed_s if token.lemma_==token.head.lemma_][0]

In [224]:
def parse_msr(df, indexer):
    
    X_dic, Y_dic = defaultdict(lambda x: defaultdict(list)), \
                   defaultdict(lambda x: defaultdict(list))
    
    for i in indexer:
        entry_dic = defaultdict(list)
        s1, s2 = df.ix[i]['#1 String'].decode('utf-8','ignore')[:-1], \
                 df.ix[i]['#2 String'].decode('utf-8','ignore')[:-1] 
                # get rid of period, which causes problem in distinguishing identical tokens.
        
        parsed_s1, parsed_s2 = parser(unicode(s1)), parser(unicode(s2))
        entry_dic['s1'] = [token.orth_ for token in parsed_s1]
        entry_dic['s2'] = [token.orth_ for token in parsed_s2]
        entry_dic['s1_lm'] = [token.lemma_ for token in parsed_s1]
        entry_dic['s2_lm'] = [token.lemma_ for token in parsed_s2]
        entry_dic['s1_dep_lm'] = dep_lemmas(parsed_s1) # for dep lemma features.
        entry_dic['s2_dep_lm'] = dep_lemmas(parsed_s2)
        entry_dic['s1_dep_tk'] = dep_tokens(parsed_s1) # for dep token features.
        entry_dic['s2_dep_tk'] = dep_tokens(parsed_s2)  
        entry_dic['s1_root'] = get_root(parsed_s1)
        entry_dic['s2_root'] = get_root(parsed_s2)
        entry_dic['s1_id'] = df.ix[i]['#1 ID'] # for error analysis later.
        entry_dic['s2_id'] = df.ix[i]['#2 ID']
        X_dic[i] = entry_dic
        Y_dic[i] = df.ix[i]['﻿Quality']
    
    return X_dic, Y_dic


In [225]:
%%time
X_train, Y_train = parse_msr(df_train, df_train.index)

CPU times: user 14.9 s, sys: 65.3 ms, total: 15 s
Wall time: 15 s


In [226]:
%%time
X_test, Y_test = parse_msr(df_test, df_test.index)

CPU times: user 6.59 s, sys: 17.5 ms, total: 6.61 s
Wall time: 6.62 s


In [227]:
print 'sentence 1: ', X_train[0]['s1']; print
print 'sentence 2: ', X_train[0]['s2']; print
print 'paraphrase label: ', Y_train[0]

sentence 1:  [u'Amrozi', u'accused', u'his', u'brother', u',', u'whom', u'he', u'called', u'the', u'witness', u',', u'of', u'deliberately', u'distorting', u'his', u'evidence']

sentence 2:  [u'Referring', u'to', u'him', u'as', u'only', u'the', u'witness', u',', u'Amrozi', u'accused', u'his', u'brother', u'of', u'deliberately', u'distorting', u'his', u'evidence']

paraphrase label:  1


##### Load Brown

In [20]:
def parse_brown():
    
    sents = brown.sents()
    parsed_sents = [parser(' '.join(sent)) for sent in sents]
    lemma_words = [token.lemma_ for parsed_sent in parsed_sents for token in parsed_sent]
    
    return lemma_words

In [21]:
%%time
brown_words = parse_brown()

CPU times: user 1min 25s, sys: 469 ms, total: 1min 25s
Wall time: 1min 25s


In [22]:
N = len(brown_words)
N

1188973

## I. Discriminativity Weighting (IDF)

**Math**

* $IDF(w) = log\frac{N}{df_w}$, where $N$ is the number of words in a corpus; $df_w$ is word $w$'s frequency in the corpus.

In [23]:
from __future__ import division

In [24]:
log = lambda x: np.log(x) if x>0 else 0 
    # intuitively N > word_count(w) for any w,
    #  therefore we cannot let idf(w) be negative
    #  even when word_count(w) = 0 for w.
div = lambda x,y: x/y if y!=0 else 0

In [25]:
def idf(w):
    
    return log(div(N,brown_words.count(w)))

In [26]:
print "'the': ", idf('the')
print "'discriminate': ", idf('discriminate')

'the':  2.83230709
'discriminate':  12.0426903182


## IIa. Features

### A.  Unigram Prec/Rec

**Math**

* $Uni\_Prec(s_1,s_2) = \frac{word\_overlap(s_1,s_2)\cdot \left(\sum_{w\in s_1\cap s_2}log\frac{N}{df_w}\right)}{word\_count(s_1)}$ (cf. Wan et al. 2006:133, weighted by $IDF$)


* $Uni\_Rec(s_1,s_2) = \frac{word\_overlap(s_1,s_2)\cdot \left(\sum_{w\in s_1\cap s_2}log\frac{N}{df_w}\right)}{word\_count(s_2)}$ (cf. ibid.)

In [179]:
intersection = lambda s1,s2: set(s1).intersection(set(s2))
word_overlap = lambda s1,s2: len(intersection(s1,s2))
lemmatize = lambda s: [token.lemma_ for token in parser(' '.join(s))]

In [180]:
def uni_prec(s1, s2): # s1,s2 assumed to be lists of words (lemmas or tokens)

    return div(word_overlap(s1,s2) * \
               sum(idf(w) for w in intersection(s1,s2)),
               len(s1))


In [181]:
def uni_rec(s1, s2):   
    
    return div(word_overlap(s1,s2) * \
               sum(idf(w) for w in intersection(s1,s2)),
               len(s2))


In [182]:
s0 = X_train[0]['s1']
s1 = X_train[0]['s2'] # known to be the paraphrase of q
s2 = X_train[1]['s1'] # known to not be the paraphrase of q
s0_lm = X_train[0]['s1_lm']
s1_lm = X_train[0]['s2_lm']
s2_lm = X_train[1]['s1_lm']

In [183]:
%%time
print uni_prec(s0,s1)
print uni_prec(s0,s2)
print uni_prec(s0_lm,s1_lm)
print uni_prec(s0_lm,s2_lm)

36.1294834287
0.177019193125
50.9323766953
0.177019193125
CPU times: user 620 ms, sys: 1.21 ms, total: 622 ms
Wall time: 622 ms


In [184]:
%%time
print uni_rec(s0,s1)
print uni_rec(s0,s2)
print uni_rec(s0_lm,s1_lm)
print uni_rec(s0_lm,s2_lm)

34.0042196976
0.177019193125
47.9363545368
0.177019193125
CPU times: user 632 ms, sys: 1.13 ms, total: 634 ms
Wall time: 634 ms


### B. BLEU Prec/Rec

**NB** (cf. Wan et al. 2006:133)

* "... Bleu metric uses the geometric average of unigram, bigram and trigram precision scores."
* "... by reversing [two sentences], ... a recall version of Bleu is obtained."

In [185]:
from nltk import bleu

In [186]:
def bleu_prec(s1, s2, lemmatized=False): # s1 as the 'hypothesis'

    return bleu(s2,s1)

In [187]:
def bleu_rec(s1, s2, lemmatized=False): # s2 as the 'hypothesis' 
    
    return bleu(s1,s2)

In [188]:
%%time
print bleu_prec(s0,s1)
print bleu_prec(s0,s2)
print bleu_prec(s0_lm,s1_lm)
print bleu_prec(s0_lm,s2_lm)

0.5
0
0.5
0
CPU times: user 7.14 ms, sys: 2.93 ms, total: 10.1 ms
Wall time: 7.61 ms


In [189]:
%%time
print bleu_rec(s0,s1)
print bleu_rec(s0,s2)
print bleu_rec(s0_lm,s1_lm)
print bleu_rec(s0_lm,s2_lm)

0.492479060505
0
0.492479060505
0
CPU times: user 6.58 ms, sys: 2.44 ms, total: 9.01 ms
Wall time: 7.4 ms


### C. Dependency Prec/Rec

**Math**

* $Dep\_Prec(s_1,s_2) = \frac{|dep\_pair(s_1)|\cap|dep\_pair(s_2)|}{|dep\_pair(s_1)|}$ (cf. Wan et al. 2006:134)


* $Dep\_Rec(s_1,s_2) = \frac{|dep\_pair(s_1)|\cap|dep\_pair(s_2)|}{|dep\_pair(s_2)|}$ (cf. ibid.)

**NB**: $relation$ in the reference confuses *dependency pair* with *dependency relation*. $relation$ refers to "... a pair of words in a parent-child relationship within the dependency tree, referred to as head-modifier relationship. ... we ignore the label of the relationships which indicates the semantic role".

In [190]:
def dep_prec(dep_pairs_s1, dep_pairs_s2):
    
    return div(len(dep_pairs_s1.intersection(dep_pairs_s2)),
               len(dep_pairs_s1))
        

In [191]:
def dep_rec(dep_pairs_s1, dep_pairs_s2):
    
    return div(len(dep_pairs_s1.intersection(dep_pairs_s2)),
               len(dep_pairs_s2))


In [192]:
s0_dep_tk = X_train[0]['s1_dep_tk']
s1_dep_tk = X_train[0]['s2_dep_tk']
s2_dep_tk = X_train[1]['s1_dep_tk']
s0_dep_lm = X_train[0]['s1_dep_lm']
s1_dep_lm = X_train[0]['s2_dep_lm']
s2_dep_lm = X_train[1]['s1_dep_lm']

In [193]:
%%time
print dep_prec(s0_dep_tk,s1_dep_tk)
print dep_prec(s0_dep_tk,s2_dep_tk)
print dep_prec(s0_dep_lm,s1_dep_lm)
print dep_prec(s0_dep_lm,s2_dep_lm)

0.571428571429
0.0
0.571428571429
0.0
CPU times: user 278 µs, sys: 125 µs, total: 403 µs
Wall time: 287 µs


In [194]:
%%time
print dep_prec(s0_dep_tk,s1_dep_tk)
print dep_prec(s0_dep_tk,s2_dep_tk)
print dep_prec(s0_dep_lm,s1_dep_lm)
print dep_prec(s0_dep_lm,s2_dep_lm)

0.571428571429
0.0
0.571428571429
0.0
CPU times: user 419 µs, sys: 348 µs, total: 767 µs
Wall time: 496 µs


### D. F1

**Math**

* $F1 = 2\cdot\frac{prec\cdot rec}{prec + rec}$ (cf. https://en.wikipedia.org/wiki/F1_score)

In [195]:
def f1_unigram(s1, s2):
    
    prec, rec = uni_prec(s1,s2), uni_rec(s1,s2)
    
    return prec, rec, 2*div(prec*rec,prec+rec) # so later we only do uni_prec/rec once!

In [196]:
def f1_bleu(s1, s2):
    
    prec, rec = bleu_prec(s1,s2), bleu_rec(s1,s2)
    
    return prec, rec, 2*div(prec*rec,prec+rec)

In [197]:
def f1_dep(dep_pairs_s1, dep_pairs_s2):
   
    prec, rec = dep_prec(dep_pairs_s1,dep_pairs_s2), \
                dep_rec(dep_pairs_s1,dep_pairs_s2)
    
    return prec, rec, 2*div(prec*rec,prec+rec)

In [198]:
%%time
print f1_unigram(s0,s1)[2]
print f1_unigram(s0,s2)[2]
print f1_unigram(s0_lm,s1_lm)[2]
print f1_unigram(s0_lm,s2_lm)[2]

35.0346505975
0.177019193125
49.3889713409
0.177019193125
CPU times: user 1.3 s, sys: 2.82 ms, total: 1.3 s
Wall time: 1.31 s


In [199]:
%%time
print f1_bleu(s0,s1)[2]
print f1_bleu(s0,s2)[2]
print f1_bleu(s0_lm,s1_lm)[2]
print f1_bleu(s0_lm,s2_lm)[2]

0.496211033666
0
0.496211033666
0
CPU times: user 13.9 ms, sys: 4.99 ms, total: 18.9 ms
Wall time: 15.1 ms


In [200]:
%%time
print f1_dep(s0_dep_tk,s1_dep_tk)[2]
print f1_dep(s0_dep_tk,s2_dep_tk)[2]
print f1_dep(s0_dep_lm,s1_dep_lm)[2]
print f1_dep(s0_dep_lm,s2_dep_lm)[2]

0.533333333333
0
0.533333333333
0
CPU times: user 219 µs, sys: 134 µs, total: 353 µs
Wall time: 260 µs


### E. Tree Edit Distance

In [248]:
from zss import simple_distance, Node
    # use zss.distance if dynamic tree modification is needed. 
    #  cf. zss api: pythonhosted.org/zss/.

In [249]:
def make_tree(token, lemmatized):
    
    node = Node(token.lemma_) if lemmatized else Node(token.orth_)
    children = get_children(token)
    if len(children)==0: return node
    for child in children:
        node.addkid(make_tree(child, lemmatized))
    
    return node
    

In [250]:
def tree_edit_dist(root_s1, root_s2, lemmatized=False):
    
    return simple_distance(make_tree(root_s1, lemmatized),
                           make_tree(root_s2, lemmatized))
    

In [251]:
s0_root = X_train[0]['s1_root']
s1_root = X_train[0]['s2_root']
s2_root = X_train[1]['s1_root']

In [252]:
%%time
print tree_edit_dist(s0_root,s1_root)
print tree_edit_dist(s0_root,s2_root)

13
21
CPU times: user 18.2 ms, sys: 8.67 ms, total: 26.9 ms
Wall time: 20 ms


In [254]:
%%time
print tree_edit_dist(s0_root,s1_root,lemmatized=True)
print tree_edit_dist(s0_root,s2_root,lemmatized=True)

13
21
CPU times: user 17.1 ms, sys: 7.2 ms, total: 24.3 ms
Wall time: 18.9 ms


### F. Sentence Lengths

* "... the difference in length of two sentences ... measured in words by subtracting one length from the other." (cf. Wan et al. 2006:134)
* "... this difference could be a negative or positive integer ... an absolute variant was used." (cf. ibid.)

In [55]:
def sent_len_diffs(s1, s2):
    
    diff = len(s1)-len(s2)
    
    return [diff, abs(diff)]

In [56]:
print sent_len_diffs(s0,s1)
print sent_len_diffs(s0,s2)

[-1, 1]
[0, 0]


## IIb: Featurization Function

**Features (22 in total)**:

* Unigram Prec/Rec + lemmatized variant: 4
* Bleu Prec/Rec + lemmatized variant: 4
* Dependency Prec/Rec + lemmatized variant: 4
* F1 Unigram, Bleu, Dependency + lemmatized variant: 6
* Tree Edit Distance + lemmatized variant: 2
* Sentence Lengths: 2

In [57]:
import numpy as np

In [255]:
def featurize_new(s1, s2): 
    # featurize a new input.
    # s is a list of words.

    parsed_s1, parsed_s2 = parser(s1), parser(s2)
    s1_lm = [token.lemma_ for token in parsed_s1]
    s2_lm = [token.lemma_ for token in parsed_s2]
    s1_dep_lm, s2_dep_lm = dep_lemmas(parsed_s1), dep_lemmas(parsed_s2)
    s1_dep_tk, s2_dep_tk = dep_tokens(parsed_s1), dep_tokens(parsed_s2)
    s1_root, s2_root = get_root(parsed_s1), get_root(parsed_s2)
    
    uni_tk_prec, uni_tk_rec, uni_tk_f1 = f1_unigram(s1, s2)
    uni_lm_prec, uni_lm_rec, uni_lm_f1 = f1_unigram(s1_lm, s2_lm)
    bleu_tk_prec, bleu_tk_rec, bleu_tk_f1 = f1_bleu(s1, s2)
    bleu_lm_prec, bleu_lm_rec, bleu_lm_f1 = f1_bleu(s1_lm, s2_lm)
    dep_tk_prec, dep_tk_rec, dep_tk_f1 = f1_dep(s1_dep_tk, s2_dep_tk)
    dep_lm_prec, dep_lm_rec, dep_lm_f1 = f1_dep(s1_dep_lm, s2_dep_lm)
    tree_tk_dist = tree_edit_dist(s1_root, s2_root)
    tree_lm_dist = tree_edit_dist(s1_root, s2_root, lemmatized=True)
    diff, abs_diff = sent_len_diffs(s1, s2)

    return [uni_tk_prec, uni_tk_rec, uni_tk_f1,
            uni_lm_prec, uni_lm_rec, uni_lm_f1,
            bleu_tk_prec, bleu_tk_rec, bleu_tk_f1,
            bleu_lm_prec, bleu_lm_rec, bleu_lm_f1,
            dep_tk_prec, dep_tk_rec, dep_tk_f1,
            dep_lm_prec, dep_lm_rec, dep_lm_f1,
            tree_tk_dist, tree_lm_dist,
            diff, abs_diff]


In [259]:
s0_asnew, s1_asnew, s2_asnew = ' '.join(s0), ' '.join(s1), ' '.join(s2)
print s0_asnew
print s1_asnew
print s2_asnew

Amrozi accused his brother , whom he called the witness , of deliberately distorting his evidence
Referring to him as only the witness , Amrozi accused his brother of deliberately distorting his evidence
Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion


In [261]:
print featurize_new(s0_asnew,s1_asnew)
print 
print featurize_new(s0_asnew,s2_asnew)

[52.231568075980661, 48.252020032096418, 50.162991122476484, 50.932376695342796, 47.936354536793218, 49.388971340938468, 0.7052772528401912, 0.691441569283882, 0.6982908839768287, 0.5, 0.4924790605054523, 0.49621103366618263, 0.5714285714285714, 0.5, 0.5333333333333333, 0.5714285714285714, 0.5, 0.5333333333333333, 13, 13, -8, 8]

[38.373129966155794, 42.783834559966806, 40.45862615996861, 0.17701919312505979, 0.17701919312505979, 0.17701919312505976, 0.6738520677318575, 0.6924328859069189, 0.6830161317245619, 0, 0, 0, 0.0, 0.0, 0, 0.0, 0.0, 0, 21, 21, 10, 10]


## IV. Paraphrase Classifier

### A. Featurizing Training/Test from MSR

* "... the training set contains 2753 true paraphrase pairs and 1323 false paraphrase pairs; ... the test set contains 1147 and 578 pairs, respectively." (cf. Ji & Eisenstein 2013:893)

In [262]:
print X_train[0].keys()

['s1_dep_lm', 's1_lm', 's2_dep_lm', 's2', 's1', 's2_root', 's1_dep_tk', 's1_root', 's1_id', 's2_dep_tk', 's2_id', 's2_lm']


In [263]:
def featurize_set(X, Y):
    
    X_list, Y_list = [], []
    
    for i in xrange(len(X)):

        uni_tk_prec, uni_tk_rec, uni_tk_f1 = f1_unigram(X[i]['s1'], X[i]['s2'])
        uni_lm_prec, uni_lm_rec, uni_lm_f1 = f1_unigram(X[i]['s1_lm'], X[i]['s2_lm'])
        bleu_tk_prec, bleu_tk_rec, bleu_tk_f1 = f1_bleu(X[i]['s1'], X[i]['s2'])
        bleu_lm_prec, bleu_lm_rec, bleu_lm_f1 = f1_bleu(X[i]['s1_lm'], X[i]['s2_lm'])
        dep_tk_prec, dep_tk_rec, dep_tk_f1 = f1_dep(X[i]['s1_dep_tk'], X[i]['s2_dep_tk'])
        dep_lm_prec, dep_lm_rec, dep_lm_f1 = f1_dep(X[i]['s1_dep_lm'], X[i]['s2_dep_lm'])
        tree_tk_dist = tree_edit_dist(X[i]['s1_root'], X[i]['s2_root'])
        tree_lm_dist = tree_edit_dist(X[i]['s1_root'], X[i]['s2_root'],lemmatized=True) 
        diff, abs_diff = sent_len_diffs(X[i]['s1'], X[i]['s2'])
        X_list.append(
            [uni_tk_prec, uni_tk_rec, uni_tk_f1,
             uni_lm_prec, uni_lm_rec, uni_lm_f1,
             bleu_tk_prec, bleu_tk_rec, bleu_tk_f1,
             bleu_lm_prec, bleu_lm_rec, bleu_lm_f1,
             dep_tk_prec, dep_tk_rec, dep_tk_f1,
             dep_lm_prec, dep_lm_rec, dep_lm_f1,
             tree_tk_dist, tree_lm_dist,
            diff, abs_diff]
        )
        Y_list.append(Y[i])
    
    return X_list, Y_list

In [268]:
%%time
X_train_fts, Y_train_fts = featurize_set(X_train, Y_train)

CPU times: user 1h 55min 21s, sys: 35 s, total: 1h 55min 56s
Wall time: 1h 56min 57s


In [269]:
%%time
X_test_fts, Y_test_fts = featurize_set(X_test, Y_test)

CPU times: user 38min 25s, sys: 5.13 s, total: 38min 30s
Wall time: 38min 30s


In [270]:
import cPickle

In [271]:
data_path = "/Users/jacobsw/Desktop/WORK/OJO_CODE/SENTENCE_SIMILARITIES/DATA/"

In [272]:
# with open(data_path+'train.p','wb') as f_train:
#     cPickle.dump((X_train_fts,Y_train_fts), f_train)
# with open(data_path+'test.p','wb') as f_test:
#     cPickle.dump((X_test_fts,Y_test_fts), f_test)

In [286]:
print X_train_fts[0]

[36.129483428692055, 34.004219697592525, 35.034650597519565, 50.932376695342796, 47.936354536793218, 49.388971340938468, 0.5, 0.4924790605054523, 0.49621103366618263, 0.5, 0.4924790605054523, 0.49621103366618263, 0.5714285714285714, 0.5, 0.5333333333333333, 0.5714285714285714, 0.5, 0.5333333333333333, 13, 13, -1, 1]


### B. Logistic Regression

In [282]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [275]:
lr = LogisticRegression()

In [276]:
lr.fit(X_train_fts, Y_train_fts)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [277]:
y_true = Y_test_fts
y_pred = lr.predict(X_test_fts)

In [285]:
print (accuracy_score(y_true,y_pred)*100)

73.3913043478


In [284]:
print classification_report(y_true,y_pred)

             precision    recall  f1-score   support

          0       0.63      0.51      0.56       578
          1       0.77      0.85      0.81      1147

avg / total       0.72      0.73      0.73      1725



In [None]:
# TODO: top5 accuracy

In [297]:
lr.predict([X_test_fts[0]])

array([1])

In [298]:
lr.predict_proba([X_test_fts[0]])

array([[ 0.43530027,  0.56469973]])

### C. SVM

In [287]:
from sklearn import svm

In [300]:
clf = svm.SVC(probability=True)

In [301]:
clf.fit(X_train_fts, Y_train_fts)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [302]:
y_true = Y_test_fts
y_pred = clf.predict(X_test_fts)

In [303]:
print (accuracy_score(y_true,y_pred)*100)

69.1594202899


In [304]:
print classification_report(y_true,y_pred)

             precision    recall  f1-score   support

          0       0.58      0.29      0.39       578
          1       0.71      0.89      0.79      1147

avg / total       0.67      0.69      0.66      1725



In [None]:
# TODO: top5 accuracy

In [306]:
clf.predict([X_test_fts[0]])

array([1])

In [307]:
clf.predict_proba([X_test_fts[0]])

array([[ 0.28224807,  0.71775193]])