# Sentence Similarity Measures IIIb: SOA Wide-Inclusive Sentence Featurization

## 0. Contents

* I. Corpora
    * MSR Paraphrase Corpus (Dolan et al. 2004 (D04))
* II. Sent-Feature Matrix (Guo & Diab 2012 (GD12); Ji & Eisenstein 2013 (JE13))
* III. TF-KLD Weigthing (JE13)
* IV. NMF Reduction (Lee & Seung 2003 (LS03))
* V. Final Featurization (Wan et al. 2006 (W06); JE13)
* VI. Evaluation 
    * Logistic Regression
    * SVM
    * Customized Paraphrase Searching

## I. Corpora

In [1]:
from nltk import bigrams
import numpy as np
import pandas as pd
from spacy.en import English
from collections import defaultdict

In [2]:
parser = English()

##### Load MSR

In [3]:
train_path = "/Users/jacobsw/Desktop/WORK/OJO_CODE/SENTENCE_SIMILARITIES/CORPORA/paraphrase/msr_paraphrase_train.txt"
test_path = "/Users/jacobsw/Desktop/WORK/OJO_CODE/SENTENCE_SIMILARITIES/CORPORA/paraphrase/msr_paraphrase_test.txt"

In [4]:
df_train = pd.read_table(train_path, encoding='utf-8-sig')
df_test = pd.read_table(test_path, encoding='utf-8-sig')
df_train.head()

Unnamed: 0,Quality,#1 ID,#2 ID,#1 String,#2 String
0,1,702876,702977,"Amrozi accused his brother, whom he called the...","Referring to him as only the witness, Amrozi a..."
1,0,2108705,2108831,Yucaipa owned Dominick's before selling the ch...,Yucaipa bought Dominick's in 1995 for $693 mil...
2,1,1330381,1330521,They had published an advertisement on the Int...,"On June 10, the ship's owners had published an..."
3,0,3344667,3344648,"Around 0335 GMT, Tab shares were up 19 cents, ...","Tab shares jumped 20 cents, or 4.6%, to set a ..."
4,1,1236820,1236712,"The stock rose $2.11, or about 11 percent, to ...",PG&E Corp. shares jumped $1.63 or 8 percent to...


In [5]:
print df_train.shape
print df_test.shape

(4076, 5)
(1725, 5)


In [6]:
df_train.ix[0] # NB: index Quality is actually weirdly '﻿Quality'.

Quality                                                      1
#1 ID                                                   702876
#2 ID                                                   702977
#1 String    Amrozi accused his brother, whom he called the...
#2 String    Referring to him as only the witness, Amrozi a...
Name: 0, dtype: object

In [7]:
df_train.ix[0]['#1 String']

u'Amrozi accused his brother, whom he called the witness, of deliberately distorting his evidence.'

##### Make Train/Test

In [8]:
def dep_pairs(parsed_s):
    return [(token.head.lemma_,'rel',token.lemma_) 
            for token in parsed_s if token.head.lemma_!=token.lemma_] # differentiate dep-pairs from bigrams

In [9]:
def parse_msr(df, indexer):
    
    X_dic, Y_dic = defaultdict(lambda x: defaultdict(list)), \
                   defaultdict(lambda x: defaultdict(list))

    for i in indexer:
        
        entry_dic = defaultdict(list)
        s1, s2 = df.ix[i]['#1 String'][:-1], \
                 df.ix[i]['#2 String'][:-1] 
                # get rid of period, which causes problem in distinguishing identical tokens.
        
        parsed_s1, parsed_s2 = parser(unicode(s1)), parser(unicode(s2))

        entry_dic['s1'] = [token.lemma_ for token in parsed_s1]
        entry_dic['s2'] = [token.lemma_ for token in parsed_s2]
        entry_dic['s1_ft'] = entry_dic['s1'] + \
                             [bigram for bigram in bigrams(entry_dic['s1'])] + \
                             dep_pairs(parsed_s1)
        entry_dic['s2_ft'] = entry_dic['s2'] + \
                             [bigram for bigram in bigrams(entry_dic['s2'])] + \
                             dep_pairs(parsed_s2) 
        entry_dic['s1_id'] = df.ix[i]['#1 ID'] # for error analysis later.
        entry_dic['s2_id'] = df.ix[i]['#2 ID']
        
        X_dic[i] = entry_dic
        Y_dic[i] = df.ix[i]['Quality']
    
    return X_dic, Y_dic


In [10]:
%%time
X_train, Y_train = parse_msr(df_train, df_train.index)

CPU times: user 18.4 s, sys: 183 ms, total: 18.6 s
Wall time: 18.7 s


In [11]:
%%time
X_test, Y_test = parse_msr(df_test, df_test.index)

CPU times: user 7.87 s, sys: 89.1 ms, total: 7.95 s
Wall time: 8.01 s


In [12]:
sample = X_train[0]

In [13]:
print sample['s1']; print
print sample['s1_ft']

[u'amrozi', u'accuse', u'his', u'brother', u',', u'whom', u'he', u'call', u'the', u'witness', u',', u'of', u'deliberately', u'distort', u'his', u'evidence']

[u'amrozi', u'accuse', u'his', u'brother', u',', u'whom', u'he', u'call', u'the', u'witness', u',', u'of', u'deliberately', u'distort', u'his', u'evidence', (u'amrozi', u'accuse'), (u'accuse', u'his'), (u'his', u'brother'), (u'brother', u','), (u',', u'whom'), (u'whom', u'he'), (u'he', u'call'), (u'call', u'the'), (u'the', u'witness'), (u'witness', u','), (u',', u'of'), (u'of', u'deliberately'), (u'deliberately', u'distort'), (u'distort', u'his'), (u'his', u'evidence'), (u'accuse', 'rel', u'amrozi'), (u'brother', 'rel', u'his'), (u'accuse', 'rel', u'brother'), (u'brother', 'rel', u','), (u'call', 'rel', u'whom'), (u'call', 'rel', u'he'), (u'brother', 'rel', u'call'), (u'witness', 'rel', u'the'), (u'call', 'rel', u'witness'), (u'brother', 'rel', u','), (u'accuse', 'rel', u'of'), (u'distort', 'rel', u'deliberately'), (u'of', 'rel', 

##### Get Sent/Unigram/Bigram/Dep-pair Dictionary

In [14]:
def indexing(train, test):
    
    idx2id, idx2ft = set(), set()
    
    for x in train.values(): # we have to keep the two sets separate
        idx2id.update([x['s1_id'],x['s2_id']])
        idx2ft.update(x['s1_ft']+x['s2_ft'])
    for x in test.values():
        idx2id.update([x['s1_id'],x['s2_id']])
        idx2ft.update(x['s1_ft']+x['s2_ft']) 
    
    idx2id, idx2ft = list(idx2id),list(idx2ft)
    
    id2idx = {i:idx for idx,i in enumerate(idx2id)}
    ft2idx = {ft:idx for idx,ft in enumerate(idx2ft)}
        
    return {'idx2id':idx2id, 'idx2ft':idx2ft,
            'id2idx':id2idx, 'ft2idx':ft2idx}


In [15]:
dicts = indexing(X_train, X_test)

In [16]:
nrows = len(dicts['idx2id'])
ncols = len(dicts['idx2ft'])
print '# rows: ', nrows
print '# cols: ', ncols

# rows:  10948
# cols:  191497


## II. Sent-Feature Matrix

In [65]:
def s2f_matrix(dicts, train, test):
    
    s2f = np.zeros((nrows,ncols))
    idx2id, idx2ft = dicts['idx2id'], dicts['idx2ft'] # easier to refer to.
    id2idx, ft2idx = dicts['id2idx'], dicts['ft2idx']
    
    for x in train.values():
        r_idx1, r_idx2 = id2idx[x['s1_id']], id2idx[x['s2_id']]
        for ft in x['s1_ft']+x['s1_ft']:
            s2f[r_idx1][ft2idx[ft]] += 1
        for ft in x['s2_ft']+x['s2_ft']:
            s2f[r_idx2][ft2idx[ft]] += 1        
    for x in test.values():
        r_idx1, r_idx2 = id2idx[x['s1_id']], id2idx[x['s2_id']]
        for ft in x['s1_ft']+x['s1_ft']:
            s2f[r_idx1][ft2idx[ft]] += 1
        for ft in x['s2_ft']+x['s2_ft']:
            s2f[r_idx2][ft2idx[ft]] += 1 
    
    return s2f


In [66]:
%%time
s2f = s2f_matrix(dicts, X_train, X_test)

CPU times: user 1.79 s, sys: 831 ms, total: 2.62 s
Wall time: 2.7 s


In [67]:
s2f.shape

(10948, 191497)

##### To Sparse

In [68]:
from scipy.sparse import csr_matrix, csc_matrix

In [69]:
%%time
s2f_csr = csr_matrix(s2f)

CPU times: user 25.2 s, sys: 13.4 s, total: 38.7 s
Wall time: 43.4 s


In [70]:
%%time
s2f_csc = csc_matrix(s2f)

CPU times: user 24.6 s, sys: 14.7 s, total: 39.3 s
Wall time: 45.2 s


In [71]:
print s2f_csr.shape
print s2f_csc.shape

(10948, 191497)
(10948, 191497)


## III. TF-KLD Weighting

### A. Compute KLD

**NB**: Only computed on the training set, to avoid test leak.

**Math** (cf. JE13:892)

* $p_k = P(w_{ik}^{(1)}|w_{ik}^{(2)},r_i=1)$, probability that sentence $w_i^{(1)}$ contains features $k$, given that $k$ appears in $w_i^{(2)}$ and the two sentences are labeled as paraphrases, $r_i=1$.
* $q_k = P(w_{ik}^{(1)}|w_{ik}^{(2)},r_i=0)$, probability that sentence $w_i^{(1)}$ contains features $k$, given that $k$ appears in $w_i^{(2)}$ and the two sentences are labeled as not paraphrases, $r_i=0$.
* **NB**: $w_i^{(1)}$ and $w_i^{(2)}$ are binarized vectors of distributional features for sentence $1$ and $2$ in a pair, respectively. The order of the sentences within the pair is assumed irrelevant.

In [72]:
from __future__ import division

In [73]:
def div(x, y):
    return x/y if y!=0 else 0

In [122]:
def kl(p, q):
    # pi==0: lim(0*log0)=0
    # qi==0: actually inf, but since we have binary ps and qs,
    #       and the kl in the extreme case of [.99999,.00001] vs. [.00001,.99999]
    #       is equal to ~11.51, i use 10 as a "big-kl-divergence" penalty.
    
    d = 0
    for pi,qi in zip(p,q):
        if pi==0: continue
        elif qi==0: 
            d += 10
            continue
        d += pi*np.log(pi/qi)
    
    return d

In [75]:
def kl2(a, b):
    
    a = np.asarray(a, dtype=np.float)
    b = np.asarray(b, dtype=np.float)

    return np.sum(np.where(a != 0, a * np.log(a / b + 1e-7), 0))

In [96]:
def kl3(p, q):
    
    d = 0
    for pi,qi in zip(p,q):
        if pi==0: continue
        elif qi==0: 
            d += 0
            continue
        d += pi*np.log(pi/qi)
    
    return d

In [118]:
def get_kld(dicts, X_train, Y_train): 

    num_p, denom_p = defaultdict(int), defaultdict(int) # r = 1 condition
    num_q, denom_q = defaultdict(int), defaultdict(int) # r = 0 condition
    fts = dicts['ft2idx'].keys()
    
    for i,x in X_train.iteritems():
        r = Y_train[i]
        if r==1:
            for ft in x['s2_ft']:
                denom_p[ft] += 1
                if ft in x['s1_ft']:
                    num_p[ft] += 1
        else:
            for ft in x['s2_ft']:
                denom_q[ft] += 1
                if ft in x['s1_ft']:
                    num_q[ft] += 1 
    
    kld = {}
    for ft in fts:
        p1, q1 = div(num_p[ft],denom_p[ft]), div(num_q[ft],denom_q[ft])
        p0, q0 = 1 - p1, 1 - q1
        kld[ft] = kl([p0,p1],[q0,q1])
    
    return kld
        

In [123]:
%%time
kld = get_kld(dicts, X_train, Y_train)

CPU times: user 3.98 s, sys: 113 ms, total: 4.09 s
Wall time: 4.04 s


In [124]:
print "small kld example: 'the' (%.6f)" % kld['the']
print "large kld example: 'accuse' (%.6f)" % kld['accuse']

small kld example: 'the' (0.004158)
large kld example: 'accuse' (9.821485)


In [125]:
kld[(u'since', 'rel', u'february')]

10

### B. Weight Each Feature by TF-KLD

In [102]:
def tf_kld(dicts, kld, s2f): # in situ weighting.
    
    ft2idx = dicts['ft2idx']
    
    for ft,kl in kld.iteritems():
        s2f[:,ft2idx[ft]] *= kl
     

In [103]:
%%time
tf_kld(dicts, kld, s2f)

CPU times: user 1min 51s, sys: 3min 14s, total: 5min 5s
Wall time: 5min 57s


## IV. NMF Reduction

In [104]:
from sklearn.decomposition import NMF

In [105]:
nmf = NMF(n_components=100,verbose=3) # cf. JE13:894

In [171]:
%%time
s2f_nmf = nmf.fit_transform(s2f_csr) # ~12min for k=100

In [126]:
import cPickle

In [127]:
data_path = "/Users/jacobsw/Desktop/WORK/OJO_CODE/SENTENCE_SIMILARITIES/DATA/"

In [128]:
# SAVE
with open(data_path+'latent_s2f_matrix_k100.p','wb') as f:
    cPickle.dump(s2f_nmf, f)
# with open(data_path+'latent_s2f_matrix_k400.p','wb') as f:
#     cPickle.dump(s2f_nmf, f)
# with open(data_path+'latent_s2f_matrix_k100_kl2.p','wb') as f:
#     cPickle.dump(s2f_nmf, f)
# with open(data_path+'latent_s2f_matrix_k100_kl3.p','wb') as f:
#     cPickle.dump(s2f_nmf, f)

# LOAD
# with open(data_path+'latent_s2f_matrix_k100.p','rb') as f:
#     s2f_nmf = cPickle.load(f)
# with open(data_path+'latent_s2f_matrix_k400.p','rb') as f:
#     s2f_nmf = cPickle.load(f)
# with open(data_path+'latent_s2f_matrix_k100_kl2.p','wb') as f:
#     s2f_nmf = cPickle.load(f)
# with open(data_path+'latent_s2f_matrix_k100_kl3.p','wb') as f:
#     s2f_nmf = cPickle.load(f)

## V. Final Featurization

**Math**

* $s(v_1,v_2) = [v_1+v_2, |v_1-v_2|]$, i.e. concatenation of the element-wise sum $v_1+v_2$ and absolute difference $|v_1-v_2|$.
* **NB**: $v_1$ and $v_2$ are the *latent representations* (i.e. dimension-reduced) of a pair of sentences.

In [129]:
# LOAD W06 FEATURES
with open(data_path+'train1.p','rb') as f_train:
    X_train_w06fts, Y_train_w06fts = cPickle.load(f_train)
with open(data_path+'test1.p','rb') as f_test:
    X_test_w06fts, Y_test_w06fts = cPickle.load(f_test)

In [133]:
def featurize(dicts, X_train, Y_train, X_test, Y_test, s2f_nmf):
    
    id2idx = dicts['id2idx']
    X_train_fts, Y_train_fts = [], []
    X_test_fts, Y_test_fts = [], []
    
    for i,x in X_train.iteritems():
        s1_idx, s2_idx = id2idx[x['s1_id']], id2idx[x['s2_id']]
        v1, v2 = s2f_nmf[s1_idx], s2f_nmf[s2_idx]
        s = np.concatenate([v1+v2,abs(v1-v2),X_train_w06fts[i]])
        X_train_fts.append(s)
        Y_train_fts.append(Y_train[i])
    for i,x in X_test.iteritems():
        s1_idx, s2_idx = id2idx[x['s1_id']], id2idx[x['s2_id']]
        v1, v2 = s2f_nmf[s1_idx], s2f_nmf[s2_idx]
        s = np.concatenate([v1+v2,abs(v1-v2),X_test_w06fts[i]])
        X_test_fts.append(s)
        Y_test_fts.append(Y_test[i])  
    
    return X_train_fts, Y_train_fts, X_test_fts, Y_test_fts
    

In [134]:
%%time
X_train_fts, Y_train_fts, X_test_fts, Y_test_fts = featurize(dicts, X_train, Y_train, X_test, Y_test, s2f_nmf)

CPU times: user 52.8 ms, sys: 9.34 ms, total: 62.1 ms
Wall time: 76.7 ms


In [135]:
len(X_train_fts[0])

222

## VI. Evaluation

In [136]:
def evaluate(X_test_fts, Y_test_fts, model):
    y_true = Y_test_fts
    y_pred = model.predict(X_test_fts)
    print 'Accuracy: %.6f' % accuracy_score(y_true,y_pred)
    print
    print classification_report(y_true,y_pred)

### A. Logistic Regression

##### Results Summary

* $K=100$: $Accuracy = .73; Prec/Rec/F1 = .72/.73/.72$
* $K=400$: $Accuracy = .73; Prec/Rec/F1 = .72/.73/.73$

In [137]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

In [138]:
lr = LogisticRegression()

In [139]:
lr.fit(X_train_fts, Y_train_fts)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [140]:
evaluate(X_test_fts, Y_test_fts, lr) # K=100 results

Accuracy: 0.731594

             precision    recall  f1-score   support

          0       0.63      0.49      0.55       578
          1       0.77      0.85      0.81      1147

avg / total       0.72      0.73      0.72      1725



### B. SVM

##### Results Summary

* **Default Setting**
    * $Linear Kernel$
        * $K=100$: $Accuracy = .73; Prec/Rec/F1 = .72/.73/.72$
        * $K=400$: $Accuracy = ; Prec/Rec/F1 = $
    * $RBF Kernel$
        * $K=100$: $Accuracy = .73; Prec/Rec/F1 = .72/.73/.72$
        * $K=400$: $Accuracy = .74; Prec/Rec/F1 = .73/.74/.73$

##### a. Default Settings

In [29]:
from sklearn import svm

In [147]:
svm_linear = svm.SVC(kernel='linear',verbose=3)
svm_rbf = svm.SVC(kernel='rbf',verbose=3)

In [151]:
%%time
svm_linear.fit(X_train_fts, Y_train_fts)

[LibSVM]CPU times: user 20.6 s, sys: 44.7 ms, total: 20.6 s
Wall time: 20.6 s


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=3)

In [152]:
%%time
svm_rbf.fit(X_train_fts, Y_train_fts)

[LibSVM]CPU times: user 3.08 s, sys: 7.32 ms, total: 3.09 s
Wall time: 3.09 s


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=3)

In [169]:
%%time
svm_linsvc = svm.LinearSVC()
svm_linsvc.fit(X_train_fts, Y_train_fts)

CPU times: user 812 ms, sys: 3.75 ms, total: 816 ms
Wall time: 816 ms


##### b. Grid Search

In [141]:
from sklearn.grid_search import GridSearchCV

In [142]:
params = {
    'C': (.01,.1,.2,.5),
}

##### Grid Search I: Linear Kernel

**K=100**

In [145]:
%%time
grd_lin_k100 = GridSearchCV(svm.SVC(kernel='linear'),params,cv=5,verbose=3,n_jobs=4)
grd_lin_k100.fit(X_train_fts,Y_train_fts)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] C=0.01 ..........................................................
[CV] C=0.01 ..........................................................
[CV] C=0.01 ..........................................................
[CV] C=0.01 ..........................................................
[CV] ................................. C=0.01, score=0.718137 -   2.6s
[CV] C=0.01 ..........................................................
[CV] ................................. C=0.01, score=0.721814 -   2.6s
[CV] C=0.1 ...........................................................
[CV] ................................. C=0.01, score=0.723039 -   2.7s
[CV] C=0.1 ...........................................................
[CV] ................................. C=0.01, score=0.721130 -   2.7s
[CV] C=0.1 ...........................................................
[CV] ................................. C=0.01, score=0.728501 -   2.6s
[CV] C=0.1 ......

[Parallel(n_jobs=4)]: Done  21 out of  20 | elapsed:   15.3s remaining:   -0.7s


[CV] .................................. C=0.2, score=0.735872 -   4.6s
[CV] C=0.5 ...........................................................


[Parallel(n_jobs=4)]: Done  21 out of  20 | elapsed:   17.2s remaining:   -0.8s


[CV] .................................. C=0.5, score=0.731618 -   7.8s
[CV] C=0.5 ...........................................................


[Parallel(n_jobs=4)]: Done  21 out of  20 | elapsed:   20.7s remaining:   -1.0s


[CV] .................................. C=0.5, score=0.719363 -   7.2s


[Parallel(n_jobs=4)]: Done  21 out of  20 | elapsed:   21.4s remaining:   -1.0s


[CV] .................................. C=0.5, score=0.721814 -   6.9s


[Parallel(n_jobs=4)]: Done  21 out of  20 | elapsed:   22.3s remaining:   -1.1s


[CV] .................................. C=0.5, score=0.732187 -   7.5s


[Parallel(n_jobs=4)]: Done  21 out of  20 | elapsed:   24.7s remaining:   -1.2s


[CV] .................................. C=0.5, score=0.740786 -   6.8s


[Parallel(n_jobs=4)]: Done  21 out of  20 | elapsed:   27.6s remaining:   -1.3s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   27.6s finished


CPU times: user 15.8 s, sys: 1.39 s, total: 17.2 s
Wall time: 37.7 s


In [144]:
print "Best Score: %.6f%%" % (grd_lin_k100.best_score_*100)
print "Best Params: "
best_params = grd_lin_k100.best_estimator_.get_params()
for param_name in sorted(params.keys()):
    print "\t%s: %r" % (param_name, best_params[param_name])

Best Score: 72.914622%
Best Params: 
	C: 0.5


**K=400**

In [30]:
%%time
grd_lin_k400 = GridSearchCV(svm.SVC(kernel='linear'),params,cv=5,verbose=3,n_jobs=4)
grd_lin_k400.fit(X_train_fts,Y_train_fts)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] C=0.01 ..........................................................
[CV] C=0.01 ..........................................................
[CV] C=0.01 ..........................................................
[CV] C=0.01 ..........................................................
[CV] ................................. C=0.01, score=0.720588 -   9.2s
[CV] C=0.01 ..........................................................
[CV] ................................. C=0.01, score=0.721814 -   9.2s
[CV] C=0.1 ...........................................................
[CV] ................................. C=0.01, score=0.725490 -   9.4s
[CV] C=0.1 ...........................................................
[CV] ................................. C=0.01, score=0.723587 -   9.5s
[CV] C=0.1 ...........................................................
[CV] ................................. C=0.01, score=0.724816 -   9.9s
[CV] C=0.1 ......

[Parallel(n_jobs=4)]: Done  21 out of  20 | elapsed:   47.8s remaining:   -2.3s


[CV] .................................. C=0.2, score=0.729730 -  13.4s
[CV] C=0.5 ...........................................................


[Parallel(n_jobs=4)]: Done  21 out of  20 | elapsed:   50.5s remaining:   -2.4s


[CV] .................................. C=0.5, score=0.730392 -  22.0s
[CV] C=0.5 ...........................................................


[Parallel(n_jobs=4)]: Done  21 out of  20 | elapsed:   59.6s remaining:   -2.8s


[CV] .................................. C=0.5, score=0.715686 -  21.7s


[Parallel(n_jobs=4)]: Done  21 out of  20 | elapsed:  1.1min remaining:   -3.2s


[CV] .................................. C=0.5, score=0.740786 -  18.2s


[Parallel(n_jobs=4)]: Done  21 out of  20 | elapsed:  1.1min remaining:   -3.3s


[CV] .................................. C=0.5, score=0.734069 -  21.4s


[Parallel(n_jobs=4)]: Done  21 out of  20 | elapsed:  1.2min remaining:   -3.3s


[CV] .................................. C=0.5, score=0.730958 -  17.9s


[Parallel(n_jobs=4)]: Done  21 out of  20 | elapsed:  1.3min remaining:   -3.7s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  1.3min finished


CPU times: user 21.7 s, sys: 2.11 s, total: 23.8 s
Wall time: 1min 32s


In [31]:
print "Best Score: %.6f%%" % (grd_lin_k400.best_score_*100)
print "Best Params: "
best_params = grd_lin_k400.best_estimator_.get_params()
for param_name in sorted(params.keys()):
    print "\t%s: %r" % (param_name, best_params[param_name])

Best Score: 73.110893%
Best Params: 
	C: 0.1


##### Grid Search II: RBF Kernel 

**K=100**

In [82]:
%%time
grd_rbf_k100 = GridSearchCV(svm.SVC(kernel='rbf'),params,cv=5,verbose=3,n_jobs=4)
grd_rbf_k100.fit(X_train_fts,Y_train_fts)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] C=0.01 ..........................................................
[CV] C=0.01 ..........................................................
[CV] C=0.01 ..........................................................
[CV] C=0.01 ..........................................................
[CV] ................................. C=0.01, score=0.675245 -   3.0s
[CV] C=0.01 ..........................................................
[CV] ................................. C=0.01, score=0.675245 -   3.1s
[CV] C=0.1 ...........................................................
[CV] ................................. C=0.01, score=0.675245 -   3.2s
[CV] C=0.1 ...........................................................
[CV] ................................. C=0.01, score=0.675676 -   3.2s
[CV] C=0.1 ...........................................................
[CV] ................................. C=0.01, score=0.675676 -   3.4s
[CV] C=0.1 ......

[Parallel(n_jobs=4)]: Done  21 out of  20 | elapsed:   14.6s remaining:   -0.7s


[CV] .................................. C=0.2, score=0.702703 -   3.5s
[CV] C=0.5 ...........................................................


[Parallel(n_jobs=4)]: Done  21 out of  20 | elapsed:   15.0s remaining:   -0.7s


[CV] .................................. C=0.5, score=0.726716 -   3.5s
[CV] C=0.5 ...........................................................


[Parallel(n_jobs=4)]: Done  21 out of  20 | elapsed:   15.4s remaining:   -0.7s


[CV] .................................. C=0.5, score=0.689951 -   3.5s


[Parallel(n_jobs=4)]: Done  21 out of  20 | elapsed:   17.7s remaining:   -0.8s


[CV] .................................. C=0.5, score=0.730392 -   3.5s


[Parallel(n_jobs=4)]: Done  21 out of  20 | elapsed:   18.1s remaining:   -0.9s


[CV] .................................. C=0.5, score=0.737101 -   3.4s


[Parallel(n_jobs=4)]: Done  21 out of  20 | elapsed:   18.5s remaining:   -0.9s


[CV] .................................. C=0.5, score=0.723587 -   3.3s


[Parallel(n_jobs=4)]: Done  21 out of  20 | elapsed:   18.8s remaining:   -0.9s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:   18.8s finished


CPU times: user 10.4 s, sys: 1.39 s, total: 11.8 s
Wall time: 22.9 s


In [83]:
print "Best Score: %.6f%%" % (grd_rbf_k100.best_score_*100)
print "Best Params: "
best_params = grd_rbf_k100.best_estimator_.get_params()
for param_name in sorted(params.keys()):
    print "\t%s: %r" % (param_name, best_params[param_name])

Best Score: 72.154073%
Best Params: 
	C: 0.5


**K=400**

In [73]:
%%time
grd_rbf_k400 = GridSearchCV(svm.SVC(kernel='rbf'),params,cv=5,verbose=3,n_jobs=4)
grd_rbf_k400.fit(X_train_fts,Y_train_fts)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] C=0.01 ..........................................................
[CV] ................................. C=0.01, score=0.675245 -   9.1s
[CV] C=0.01 ..........................................................
[CV] ................................. C=0.01, score=0.686275 -   9.4s
[CV] C=0.01 ..........................................................
[CV] ................................. C=0.01, score=0.675245 -   9.2s
[CV] C=0.01 ..........................................................
[CV] ................................. C=0.01, score=0.675676 -   9.1s
[CV] C=0.01 ..........................................................
[CV] ................................. C=0.01, score=0.679361 -   9.0s
[CV] C=0.1 ...........................................................
[CV] .................................. C=0.1, score=0.740196 -   8.8s
[CV] C=0.1 ...........................................................
[CV] ............

[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:  3.0min finished


CPU times: user 3min 6s, sys: 1.1 s, total: 3min 7s
Wall time: 3min 8s


In [74]:
print "Best Score: %.6f%%" % (grd_rbf_k400.best_score_*100)
print "Best Params: "
best_params = grd_rbf_k400.best_estimator_.get_params()
for param_name in sorted(params.keys()):
    print "\t%s: %r" % (param_name, best_params[param_name])

Best Score: 72.350343%
Best Params: 
	C: 0.1


##### Grid Search III: LinearSVC

In [35]:
from sklearn.svm import LinearSVC

In [None]:
%%time
grd_linsvc_k400 = GridSearchCV(LinearSVC(kernel='rbf'),params,cv=5,verbose=3,n_jobs=4)
grd_linsvc_k400.fit(X_train_fts,Y_train_fts)

##### Evaluation Block

In [154]:
evaluate(X_test_fts, Y_test_fts, svm_linear) # K=100 results

Accuracy: 0.731594

             precision    recall  f1-score   support

          0       0.65      0.44      0.52       578
          1       0.76      0.88      0.81      1147

avg / total       0.72      0.73      0.72      1725



In [155]:
evaluate(X_test_fts, Y_test_fts, svm_rbf) # K=100 results

Accuracy: 0.740290

             precision    recall  f1-score   support

          0       0.65      0.48      0.56       578
          1       0.77      0.87      0.82      1147

avg / total       0.73      0.74      0.73      1725



In [170]:
evaluate(X_test_fts, Y_test_fts, svm_linsvc) # K=100 results

Accuracy: 0.700870

             precision    recall  f1-score   support

          0       0.89      0.12      0.22       578
          1       0.69      0.99      0.82      1147

avg / total       0.76      0.70      0.61      1725



### C. Customized Paraphrase Searching

In [None]:
# how to project the featurized new sentence into the space of the trained?