# Testing a fiction filter

This very simply loads some training data and trains a regularized logistic regression on it, using gridsearch to find an optimal number of features and regularization constant. We optimize on F0.5 score, a harmonic mean of precision and recall that puts more emphasis on precision. In practice, I don't think this produces results hugely different from F1 score.

There are more sophisticated feature-selection strategies than simply using *n* most common, but if I used more sophisticated selection strategies I would also need a more sophisticated validation strategy to avoid fooling myself; e.g. a validation set separate from the test set. Without extensive resources for generating labeled data, I'm trying to keep things relatively quick and simple.

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import StandardScaler
import pickle



In [2]:
rawdata = pd.read_csv('trainingdata.tsv', sep = '\t')
rawdata.head()

Unnamed: 0,sequenceID,genrecode,#matchquality,#rareword,of,the,and,a,is,to,...,title,air,four,neither,latter,beauty,lord,y,u,number
0,573,y,2.502942,0.311475,0.04918,0.042623,0.02623,0.039344,0.02623,0.016393,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,208-2,y,3.153019,0.281944,0.0375,0.080556,0.030556,0.033333,0.025,0.027778,...,0.0,0.0,0.001389,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,315-3,y,2.3566,0.37619,0.047619,0.061905,0.028571,0.02381,0.014286,0.009524,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,202,y,2.834483,0.325893,0.0625,0.075893,0.040179,0.03125,0.013393,0.017857,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,460,y,2.256718,0.307233,0.036461,0.050209,0.028691,0.025702,0.019127,0.022116,...,0.0,0.000598,0.0,0.0,0.000598,0.000598,0.0,0.0,0.000598,0.0


This is the complete dataset as it exists on disk. We may not use all the features. Notice how the most common feature is '#rareword', i.e., English word outside this limited vocabulary.

We're going to select only the columns for features, leaving out the first two metadata columns.

In [3]:
termdoc = rawdata.iloc[ : , 2 : 402]
termdoc.shape

(478, 400)

Let's create a vector that maps genrecode (is it fiction n/y) to an integer code.

In [4]:
classvec = rawdata.genrecode.map({'n' : 0, 'y': 1})

Now the function that actually trains a logistic model.

In [5]:
def onepass(cval, numfeatures, termdoc, classvec):
    '''
    cval is the regularization constant
    numfeatures the number of features to use in the model
    termdoc is X, aka the feature matrix
    classvec is y, aka the vector of class integers to be predicted
    '''
    
    scaler = StandardScaler()
    data = scaler.fit_transform(termdoc.iloc[ : , 0 : numfeatures])
    
    # Note that we scale and center the columns of the feature matrix.
    
    model = LogisticRegression(C = cval)
    f1_scores = cross_val_score(model, data, classvec,
                             scoring = 'f1', cv=10)
    f1 = sum(f1_scores) / len(f1_scores)
    # Tenfold crossvalidation, using F1 score.
    
    precision_scores = cross_val_score(model, data, classvec,
                             scoring = 'precision', cv=10)
    precision = sum(precision_scores) / len(precision_scores)
    
    recall_scores = cross_val_score(model, data, classvec,
                             scoring = 'recall', cv=10)
    recall = sum(recall_scores) / len(recall_scores)
    
    f05 = 1.5 * (precision * recall) / ((.5 * precision) + recall)
    
    model.fit(data, classvec)
    predictions = model.predict(data)
    
    # We return both the average F1 score of a cross-
    # validated model, and the predictions of a model
    # trained on all the data.
    
    return f05, precision, recall, predictions

Grid search across a range of feature numbers and regularization constants. Notice that for regularization, we iterate across an integer range but then divide by ten thousand. So the best value of 100 is actually .01.

In [12]:
bestscores = []
for features in range(90, 120, 5):
    for cval in range(250, 500, 10):
        f05, precision, recall, predictions = onepass(cval/ 10000, features, termdoc, classvec)
        print(f05, cval, features)
        bestscores.append((f05, cval, features))
        

0.879913433764 250 90
0.879913433764 260 90
0.879913433764 270 90
0.882781368353 280 90
0.882781368353 290 90
0.883959573752 300 90
0.88193014606 310 90
0.885497356017 320 90
0.887303470197 330 90
0.885725046533 340 90
0.885725046533 350 90
0.885725046533 360 90
0.887022044355 370 90
0.887022044355 380 90
0.887022044355 390 90
0.883897694666 400 90
0.883897694666 410 90
0.883897694666 420 90
0.882266052414 430 90
0.882266052414 440 90
0.883412854045 450 90
0.883412854045 460 90
0.885064673086 470 90
0.885064673086 480 90
0.885064673086 490 90
0.88479005475 250 95
0.886584724374 260 95
0.886584724374 270 95
0.886584724374 280 95
0.886584724374 290 95
0.886584724374 300 95
0.883222033492 310 95
0.883222033492 320 95
0.883222033492 330 95
0.884444179709 340 95
0.884444179709 350 95
0.883200451949 360 95
0.883200451949 370 95
0.887531920866 380 95
0.887531920866 390 95
0.887531920866 400 95
0.888752515861 410 95
0.888752515861 420 95
0.888752515861 430 95
0.888752515861 440 95
0.8887525158

In [13]:
bestscores.sort()
bestscores[-1]

(0.90616806025370267, 400, 110)

Test the best model more specifically. Note that precision is good, which is important.

**Recall 78.8, precision 94.4.**

In [14]:
f05, precision, recall, predictions = onepass(.04, 110, termdoc, classvec)
print(f05, precision, recall, sum(predictions))

0.906168060254 0.924118143347 0.872281639929 332


### Produce a model to export

Ultimately we have to make a model to use, and this can't be crossvalidated.

In [15]:
scaler = StandardScaler()
data = scaler.fit_transform(termdoc.iloc[ : , 0 : 110])
model = LogisticRegression(C = .04)
model.fit(data, classvec)
print('Model trained.')

Model trained.


In [16]:
with open('model/fictionreview_scaler.pkl', mode = 'wb') as f:
    pickle.dump(scaler, f)

with open('model/fictionreview_model.pkl', mode = 'wb') as f:
    pickle.dump(model, f)

words = list(termdoc.columns)
with open('model/fictionreview_vocab.txt', mode = 'w', encoding = 'utf-8') as f:
    for w in words[0: 400]:
        f.write(w + '\n')
    

There we actually export it.

#### Examine feature weights.

In [17]:
words = list(termdoc.columns)

In [18]:
features = list(zip(list(model.coef_[0]), words[0: 110]))
features.sort()
for x, y in features:
    print(y, x)

#placename -0.33072670946
voyages -0.318318190897
letters -0.316025820311
which -0.291357045877
on -0.243732293916
was -0.230331603847
of -0.175099206423
book -0.142205727355
#rareword -0.141257546259
own -0.134639885455
as -0.133043067259
it -0.127656254876
when -0.12318173239
we -0.115418103708
no -0.109214407898
from -0.10667403415
would -0.10652392581
can -0.105381511701
#notenglishword -0.0998335574906
been -0.0939636197274
even -0.0938568671482
those -0.0876184890147
be -0.0874594975767
this -0.0744574674843
time -0.0719073164952
most -0.0697927930797
well -0.0686736966787
have -0.0678855792873
i -0.0672386863886
after -0.0662116972901
#romannumeral -0.0617962840391
and -0.0611271990167
mr -0.0576872796678
such -0.0567907944629
first -0.0566274716723
he -0.0521599916035
they -0.0517693306687
these -0.0505014664264
to -0.0415642129724
or -0.0385367708727
than -0.0379260092442
work -0.036809250317
that -0.0312364837234
if -0.0304267378171
may -0.0287708380732
us -0.020659046931
its