## Classifying Text . . . 

Using the NTK Naive Bayes classifier (by way of Textblob), which is slow, but [simple and well-documented](http://textblob.readthedocs.io/en/dev/classifiers.html).  The [NLTK documentation](http://www.nltk.org/book/ch06.html) is also useful.

*The Programming Historian* has [a useful survery of clustering and classification](https://programminghistorian.org/lessons/naive-bayesian#machine-learning).

But, unfortunately, the NTLK/Textblob Naive Bayes classifier is slow.  Very, very slow.  

*GeeksForGeeks* has [an overview of sklearn classification](https://www.geeksforgeeks.org/multiclass-classification-using-scikit-learn/), which includes code snippets.  We'll use [an sklearn Naive Bayes classifier](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB.predict), since it's so fast.

Hoyt Long and Richard So, ["Literary Pattern Recognition: Modernism between Close Reading and Machine Learning"](https://lucian.uchicago.edu/blogs/literarynetworks/files/2015/12/LONG_SO_CI.pdf), *Critical Inquiry*, Vol. 42, No. 2 (Winter 2016), pp. 235-267.  And [a 20 minute video discussing the article](https://criticalinquiry.uchicago.edu/hoyt_long_and_richard_so_on_close_reading_machine_learning_and_patterns_in/).  **Done with naive bayes.**

## . . . Using Data from the Muncie Public Library

[What Middletown Read](http://www.bsu.edu/libraries/wmr/).  175,000 transactions; ~ 1890 to 1900; ~6,000 readers with census data attached (age, gender); plain text for 60% of the transactions.

[Gendered reading in the Muncie Public Library](https://talus.artsci.wustl.edu/ageGenderCharts/examples/muncieAuthors.html)

**Very gendered reading by boys and girls.** Girls tended to read one set of books, boys to read another; i.e., a lot of books readership was ~ 70% boys, or 70% girls.

Can we distingush content differences between "boy" books and "girl" books?  How much of an exception are books by Horatio Alger?  Are best sellers more like "boy" books or "girl" books?

## Why Muncie here?  "Labeled Data"

Because classification requires [**labeled data**](https://en.wikipedia.org/wiki/Labeled_data) for training and testing.  We can use a trained and tested classifier to predict the label for unlabeled data; however, at least some of our data needs to be labeled before we can start.


In [18]:
!ls -ld corpora/muncie/*
!echo ''
!ls -1 corpora/muncie/boys/* | wc -l
!ls -1 corpora/muncie/girls/* | wc -l
!ls -1 corpora/muncie/alger/* | wc -l
!ls -1 corpora/muncie/best_sellers/* | wc -l
!ls -1 corpora/muncie/alger_boys/* | wc -l

drwxrwxr-x 2 spenteco spenteco  4096 Jun  7  2017 corpora/muncie/alger
drwxrwxr-x 2 spenteco spenteco 20480 Apr 24 15:54 corpora/muncie/alger_boys
drwxrwxr-x 2 spenteco spenteco 16384 Apr 24 09:11 corpora/muncie/best_sellers
drwxrwxr-x 2 spenteco spenteco 16384 Apr 24 09:12 corpora/muncie/boys
drwxrwxr-x 2 spenteco spenteco 12288 Apr 24 09:13 corpora/muncie/girls

69
88
48
105
116


### Routines to load corpora and texts

Notice that I'm not doing any complicated NLP processing; I'm just doing regex tokenization, and dropping stopwords.

The "class Text" is probably an unnecessary elaboration; however, when I started this notebook, I wasn't sure how much complexity I was going to need, so I wanted to come up with some way to "black box" that complexity.

In [2]:
import glob, codecs, re, random, time
from nltk.corpus import stopwords
        
sw = set(stopwords.words('english'))

class Text():
    
    def __init__(self, parm_path_to_file):
    
        self.author_title = parm_path_to_file.split('/')[-1].replace('.txt', '')
        
        self.raw_text = re.sub('\s+', ' ', codecs.open(parm_path_to_file, 'r', encoding='utf-8').read())
        
        self.tokens = []
        for t in re.split('[^a-z]', self.raw_text.lower()):
            if t > '' and t not in sw:
                self.tokens.append(t)
        
    def get_random_slice(self, parm_slice_length):
        
        last_possible_starting_position = len(self.tokens) - parm_slice_length - 1
        
        starting_position = random.randint(0, last_possible_starting_position)
        ending_position = starting_position + parm_slice_length
        
        token_slice = self.tokens[starting_position: ending_position]
        
        return token_slice
        
#  --------------------------------------------------------------------------------

start_time = time.time()

subcorpora_folders = ['alger', 'best_sellers', 'boys', 'girls', 'alger_boys']

my_corpora = {}

for folder in subcorpora_folders:
    
    print 'loading folder', folder
    
    my_corpora[folder] = []   
        
    for path_to_file in glob.glob('corpora/muncie/' + folder + '/*.txt'):
        
        my_corpora[folder].append(Text(path_to_file))

print
for k, v in my_corpora.iteritems():
    print k, len(v)
    
print
print my_corpora['boys'][0].get_random_slice(200)

stop_time = time.time()
    
print
print 'Done!', (stop_time - start_time)

loading folder alger
loading folder best_sellers
loading folder boys
loading folder girls
loading folder alger_boys

boys 69
best_sellers 105
girls 88
alger_boys 116
alger 48

[u'away', u'camp', u'possible', u'escape', u'became', u'discovered', u'frank', u'without', u'waiting', u'receive', u'congratulations', u'mate', u'looked', u'upon', u'escape', u'certain', u'thing', u'threw', u'hands', u'knees', u'moved', u'slowly', u'across', u'field', u'extended', u'mile', u'back', u'cabin', u'must', u'crossed', u'could', u'reach', u'woods', u'progress', u'slow', u'laborious', u'two', u'hours', u'reached', u'road', u'ran', u'direction', u'supposed', u'river', u'lie', u'seen', u'pickets', u'feeling', u'quite', u'certain', u'outside', u'lines', u'arose', u'feet', u'commenced', u'running', u'top', u'speed', u'road', u'ran', u'thick', u'woods', u'difficulty', u'following', u'moon', u'shining', u'brightly', u'daylight', u'arrived', u'mississippi', u'pleasant', u'sight', u'eyes', u'uttered', u'shout', 

## Basic use of the classifiers

### Pull a set of data to feed into the classifiers

NLTK and sklearn want different inputs, hences the different "samples" lists.  NLTK wants a list of tuples, one item in the list per text; the tuples have two parts: the text as a string, and the label as a string.  sklearn wants a dense matrix, so we're going to prepare "samples" suitable for passing into gensim, etc; these gensim->sklearn "samples" are a list of lists of string tokens.

In [24]:
training_sources = []
training_samples_nltk = []
training_samples_sklearn = []
training_labels = []

testing_sources = []
testing_samples_nltk = []
testing_samples_sklearn = []
testing_labels = []

for folder in ['boys', 'girls']:
    for text in my_corpora[folder]:
        
        training_slice = text.get_random_slice(1000)
    
        training_sources.append(text.author_title)
        training_samples_nltk.append((' '.join(training_slice) , folder))
        training_samples_sklearn.append(training_slice)
        training_labels.append(folder)
        
        testing_slice = text.get_random_slice(1000)
    
        testing_sources.append(text.author_title)
        testing_samples_nltk.append((' '.join(testing_slice) , folder))
        testing_samples_sklearn.append(testing_slice)
        testing_labels.append(folder)
        
#print
#print 'training_sources', training_sources[:10]
#print
#print 'training_samples_nltk', training_samples_nltk[:10]
#print
#print 'training_samples_sklearn', training_samples_sklearn[:10]
#print
#print 'training_labels', training_labels[:10]

print
print 'Done!'


Done!


### Train and test a Textblob Naive Bayes classifier

In [25]:
import time, random
from textblob.classifiers import NaiveBayesClassifier

# -------------------------------------------------------------------------------
# TRAIN BY PASSING IN ONE SET OF LABELED DATA; THIS RESULTS IN "cl", WHICH IS A
# MODEL WHICH RELATES WORDS IN THE SAMPLES TO THE LABELS IN THE SAMPLES.
# -------------------------------------------------------------------------------

start_time = time.time()
        
cl = NaiveBayesClassifier(training_samples_nltk)

stop_time = time.time()

print 'classifier training', (stop_time - start_time)

# -------------------------------------------------------------------------------
# TAKE ANOTHER SET OF LABELED DATA, AND TEST TO SEE IF THE CLASSIFIER GETS THE
# RIGHT ANSWER (I.E., DO THE WORDS IN THE TESTING SAMPLES LEAD THE THE LABELS IN
# THE TESTING SAMPLES?)
# -------------------------------------------------------------------------------

start_time = time.time()

accuracy = cl.accuracy(testing_samples_nltk)

print 'accuracy', accuracy

stop_time = time.time()

print 'classifier training', (stop_time - start_time)

# -------------------------------------------------------------------------------
# WHAT WORDS ARE MOST USEFUL IN DIFFERENTIATING BETWEEN THE LABELS?
# -------------------------------------------------------------------------------

print
print cl.show_informative_features(25)

classifier training 145.463336945
accuracy 0.968152866242
classifier training 196.911050797

Most Informative Features
          contains(guns) = True             boys : girls  =     19.1 : 1.0
          contains(camp) = True             boys : girls  =     14.0 : 1.0
        contains(lovely) = True            girls : boys   =     13.9 : 1.0
          contains(papa) = True            girls : boys   =     13.9 : 1.0
       contains(advance) = True             boys : girls  =     13.1 : 1.0
       contains(flowers) = True            girls : boys   =     12.3 : 1.0
       contains(capture) = True             boys : girls  =     12.3 : 1.0
          contains(game) = True             boys : girls  =     11.4 : 1.0
         contains(boats) = True             boys : girls  =     11.4 : 1.0
         contains(fired) = True             boys : girls  =     11.4 : 1.0
         contains(avoid) = True             boys : girls  =     11.4 : 1.0
         contains(rifle) = True             boys : girls

### sklearn does not provide an easy way . . . 

. . . to get at the most informative features.  This, if not quite right, is a reasonable substitute.

NLTK provides the word-label probabilities in a decimal format, which is easy to understand.  sklearn provides the log probabilities, which I find much harder to understand.  Doug Knox, my collegue in the HDW, suggested this bit of math to convert log probabilites to NLTK-like decimal-format probabilities.  The conversion doesn't preproduce the NLTK results, so **more work is needed here**.  But this seems provisionally workable . . . 

In [5]:
import math

def print_most_informative_sklearn(classifier, n_to_list):

    results = []

    for a in range(0, len(classifier.feature_log_prob_[0])):

            class_0 = (math.e**classifier.feature_log_prob_[0][a])
            class_1 = (math.e**classifier.feature_log_prob_[1][a])

            if class_0 > class_1:
                results.append([(class_0 / class_1), 
                                [dictionary[a].ljust(20), 'boy : girl', '\t', 
                                     '%.1f' % (class_0 / class_1), ':', str(1.0)]])
            else:
                results.append([(class_1 / class_0), 
                                [dictionary[a].ljust(20), 'girl : boy', '\t', 
                                     '%.1f' % (class_1 / class_0), ':', str(1.0)]])

    results.sort(reverse=True)

    for r in results[:n_to_list]:
        print ' '.join(r[1])

## Run sklearn Naive Bayes classifier

Note that we first go through gensim to get a dense corpus ("training_matrix" and "testing_matrix").

Lines like this:

    training_matrix = training_matrix.T
   
is [some unexplained magic](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.T.html).  Unexplained because we didn't dive into numpy.  **Bottom line**: the array from gensim.matutils.corpus2dense (which is a term-document matrix) needs to be "turned" so it's a document-term matrix.

Note how much faster this is (<0.1 second) vs NLTK/Textblob (5 or 6 minutes)?   Note that the results, while comparable to NTLK/Textblob, are not exactly the same.

Also, I'm using the BernoulliNB classifier, and not the GaussianNB classifier.  Why?  The BernoulliNB provides the data neceessary to list the most informative features; the GaussianNB does not.  What's the difference between GaussianNB and BernoulliNB?  It's not clear to me: it seems to be a question of the shape of the feature distributions expected by each . . . 

In [26]:
import time, random
from gensim import corpora, matutils
#from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB

# -----------------------------------------------------------------------------
# THE BY NOW FAMILIAR TRANSFORMATION:
#     1.  CORPUS AS A LIST OF LIST OF STRINGS ("training_samples_sklearn",
#         "testing_samples_sklearn") TO
#     2.  GENSIM DICTIONARY AND DOC2BOW CORPUS
#     3.  GENSIM SPARSE MATRIX (I.E, A DOC2BOW CORPUS) TO DENSE MATRIX
# -----------------------------------------------------------------------------

dictionary = corpora.Dictionary(training_samples_sklearn + testing_samples_sklearn)

print 'len(dictionary)', len(dictionary)

training_corpus = [dictionary.doc2bow(text) for text in training_samples_sklearn]

testing_corpus = [dictionary.doc2bow(text) for text in testing_samples_sklearn]

training_matrix = matutils.corpus2dense(training_corpus, len(dictionary))
training_matrix = training_matrix.T

print 'training_matrix', training_matrix.shape

testing_matrix = matutils.corpus2dense(testing_corpus, len(dictionary))
testing_matrix = testing_matrix.T

print 'testing_matrix', testing_matrix.shape

# -----------------------------------------------------------------------------
# TWO LINES TO TRAIN THE CLASSIFIER
# -----------------------------------------------------------------------------

start_time = time.time()

classifier = BernoulliNB()
classifier.fit(training_matrix, training_labels)

stop_time = time.time()

print 'classifier trained', (stop_time - start_time)

# -----------------------------------------------------------------------------
# ONE LINE TO ASSESS THE ACCURACY OF THE CLASSIFIER . . . 
# -----------------------------------------------------------------------------

start_time = time.time()

score = classifier.score(testing_matrix, testing_labels)

print 'accuracy', score
        
stop_time = time.time()

print 'testing done', (stop_time - start_time)

# -----------------------------------------------------------------------------
# . . . AND A CALL TO THE FUNCTION WHICH CONVERTS THE PROBABILITIES, ETC.
# -----------------------------------------------------------------------------

print
print_most_informative_sklearn(classifier, 20)



len(dictionary) 22596
training_matrix (157, 22596)
testing_matrix (157, 22596)
classifier trained 0.0401010513306
accuracy 0.955414012739
testing done 0.0406680107117

prisoner             boy : girl 	 24.1 : 1.0
captured             boy : girl 	 20.3 : 1.0
mamma                girl : boy 	 17.4 : 1.0
vessels              boy : girl 	 15.2 : 1.0
united               boy : girl 	 15.2 : 1.0
halt                 boy : girl 	 15.2 : 1.0
elsie                girl : boy 	 15.0 : 1.0
darling              girl : boy 	 15.0 : 1.0
guns                 boy : girl 	 14.6 : 1.0
tremendous           boy : girl 	 13.9 : 1.0
blankets             boy : girl 	 13.9 : 1.0
pray                 girl : boy 	 13.4 : 1.0
warriors             boy : girl 	 12.7 : 1.0
u                    boy : girl 	 12.7 : 1.0
range                boy : girl 	 12.7 : 1.0
ladder               boy : girl 	 12.7 : 1.0
hunt                 boy : girl 	 12.7 : 1.0
f                    boy : girl 	 12.7 : 1.0
dense                b

## Is Alger more like "boy" books, or "girl" books?  Best sellers?


### Routines to 1) get data and 2) train and evaluate a classifier

I want these in a separate cell, because I'm going to be doing them a bunch . .  .

In [27]:
import time, random
from gensim import corpora, matutils
from sklearn.naive_bayes import BernoulliNB

def get_some_data(folders, slice_size):

    sources = []
    samples = []
    labels = []

    for folder in folders:
        for text in my_corpora[folder]:

            random_slice = text.get_random_slice(slice_size)

            sources.append(text.author_title)
            samples.append(random_slice)
            labels.append(folder)
            
    return sources, samples, labels

def train_and_test_a_classifier(training_samples, 
                                training_labels, 
                                testing_samples, 
                                testing_labels,
                                prediction_samples):
        
    dictionary = corpora.Dictionary(training_samples + testing_samples + prediction_samples)

    training_corpus = [dictionary.doc2bow(text) for text in training_samples]

    testing_corpus = [dictionary.doc2bow(text) for text in testing_samples]

    training_matrix = matutils.corpus2dense(training_corpus, len(dictionary))
    training_matrix = training_matrix.T

    testing_matrix = matutils.corpus2dense(testing_corpus, len(dictionary))
    testing_matrix = testing_matrix.T
    
    classifier = BernoulliNB()
    classifier.fit(training_matrix, training_labels)

    score = classifier.score(testing_matrix, testing_labels)
    
    return dictionary, classifier, score


### Alger process

The question:  **Is Alger text more like texts favored by boys, or by girls?**

1.  Train and test a classifier using "boys" and "girls" books
2.  Ask the classifier to predict the labels for Alger.  Does it think the Alger data is from "boy" or "girl" books?
3.  Repeat.

### Unanswered questions

I have more girl samples than boy samples.  Does this skew my results?  Or does the accuracy number indicate that I really don't need to worry so much abou that?

In [37]:
for a in range(25):
    
    training_sources, training_samples, training_labels = get_some_data(['boys', 'girls'], 2000)
    testing_sources, testing_samples, testing_labels = get_some_data(['boys', 'girls'], 2000)
    
    alger_sources, alger_samples, alger_labels = get_some_data(['alger',], 2000)

    dictionary, classifier, score = train_and_test_a_classifier(training_samples, 
                                                                training_labels, 
                                                                testing_samples, 
                                                                testing_labels, 
                                                                alger_samples)
    
    alger_corpus = [dictionary.doc2bow(text) for text in alger_samples]

    alger_matrix = matutils.corpus2dense(alger_corpus, len(dictionary))
    alger_matrix = alger_matrix.T
    
    results = classifier.predict(alger_matrix)
    
    n_girls = 0
    n_boys = 0
    for r in results:
        if r == 'girls':
            n_girls += 1
        if r == 'boys':
            n_boys += 1
            
    print 'accuracy', '%.2f' % score, 'len(alger_samples)', len(alger_samples), \
            'n_girls', n_girls, 'n_boys', n_boys, \
            ' --> ', \
            'girls/n samples', '%.2f' % (float(n_girls) / float(len(alger_samples)))

accuracy 0.94 len(alger_samples) 48 n_girls 42 n_boys 6  -->  girls/n samples 0.88
accuracy 0.97 len(alger_samples) 48 n_girls 41 n_boys 7  -->  girls/n samples 0.85
accuracy 0.97 len(alger_samples) 48 n_girls 40 n_boys 8  -->  girls/n samples 0.83
accuracy 0.97 len(alger_samples) 48 n_girls 38 n_boys 10  -->  girls/n samples 0.79
accuracy 0.96 len(alger_samples) 48 n_girls 37 n_boys 11  -->  girls/n samples 0.77
accuracy 0.94 len(alger_samples) 48 n_girls 44 n_boys 4  -->  girls/n samples 0.92
accuracy 0.96 len(alger_samples) 48 n_girls 37 n_boys 11  -->  girls/n samples 0.77
accuracy 0.96 len(alger_samples) 48 n_girls 42 n_boys 6  -->  girls/n samples 0.88
accuracy 0.97 len(alger_samples) 48 n_girls 39 n_boys 9  -->  girls/n samples 0.81
accuracy 0.97 len(alger_samples) 48 n_girls 39 n_boys 9  -->  girls/n samples 0.81
accuracy 0.97 len(alger_samples) 48 n_girls 42 n_boys 6  -->  girls/n samples 0.88
accuracy 0.98 len(alger_samples) 48 n_girls 39 n_boys 9  -->  girls/n samples 0.81
a

### Are these accuracy numbers reasonable computed the way I think they are?

Given that I have 157 samples (69 boys + 88 girls) is it possible to compute these accuracy numbers (i.e., number correct / number of samples).  Or are is there something else going on with "accuracy"?

NLTK (several cells above) returned 

    accuracy 0.929936305732 which makes sense (11 wrong out of 157)

sklearn (immediately preceeding cell) returned:

    accuracy 0.949044585987 (8 wrong)
    accuracy 0.987261146497 (2)
    accuracy 0.955414012739 (7)
    accuracy 0.96178343949 (6)
    accuracy 0.974522292994 (4)
    accuracy 0.96178343949 (6)
    
The "accuracy" numbers look explicable.

In [30]:
for a in range(1, 12):
    print a, 'wrong = ', float(157 - a) / 157.0

1 wrong =  0.993630573248
2 wrong =  0.987261146497
3 wrong =  0.980891719745
4 wrong =  0.974522292994
5 wrong =  0.968152866242
6 wrong =  0.96178343949
7 wrong =  0.955414012739
8 wrong =  0.949044585987
9 wrong =  0.942675159236
10 wrong =  0.936305732484
11 wrong =  0.929936305732


### Same thing, but for best sellers instead of Alger

A slightly different question: **Are best sellers more like texts favored by boys, or by girls?**

In [38]:
for a in range(25):
    
    training_sources, training_samples, training_labels = get_some_data(['boys', 'girls'], 2000)
    testing_sources, testing_samples, testing_labels = get_some_data(['boys', 'girls'], 2000)
    
    best_seller_sources, best_seller_samples, best_seller_labels = get_some_data(['best_sellers',], 2000)

    dictionary, classifier, score = train_and_test_a_classifier(training_samples, 
                                                                training_labels, 
                                                                testing_samples, 
                                                                testing_labels, 
                                                                best_seller_samples)
    
    best_seller_corpus = [dictionary.doc2bow(text) for text in best_seller_samples]

    best_seller_matrix = matutils.corpus2dense(best_seller_corpus, len(dictionary))
    best_seller_matrix = best_seller_matrix.T
    
    results = classifier.predict(best_seller_matrix)
    
    n_girls = 0
    n_boys = 0
    for r in results:
        if r == 'girls':
            n_girls += 1
        if r == 'boys':
            n_boys += 1
            
    print 'accuracy', '%.2f' % score, 'len(best_seller_matrix)', len(best_seller_matrix), \
            'n_girls', n_girls, 'n_boys', n_boys, \
            ' --> ', \
            'girls/n samples', '%.2f' % (float(n_girls) / float(len(best_seller_matrix)))

accuracy 0.96 len(best_seller_matrix) 105 n_girls 95 n_boys 10  -->  girls/n samples 0.90
accuracy 0.94 len(best_seller_matrix) 105 n_girls 98 n_boys 7  -->  girls/n samples 0.93
accuracy 0.97 len(best_seller_matrix) 105 n_girls 97 n_boys 8  -->  girls/n samples 0.92
accuracy 0.96 len(best_seller_matrix) 105 n_girls 94 n_boys 11  -->  girls/n samples 0.90
accuracy 0.97 len(best_seller_matrix) 105 n_girls 97 n_boys 8  -->  girls/n samples 0.92
accuracy 0.96 len(best_seller_matrix) 105 n_girls 97 n_boys 8  -->  girls/n samples 0.92
accuracy 0.97 len(best_seller_matrix) 105 n_girls 95 n_boys 10  -->  girls/n samples 0.90
accuracy 0.95 len(best_seller_matrix) 105 n_girls 96 n_boys 9  -->  girls/n samples 0.91
accuracy 0.97 len(best_seller_matrix) 105 n_girls 95 n_boys 10  -->  girls/n samples 0.90
accuracy 0.92 len(best_seller_matrix) 105 n_girls 97 n_boys 8  -->  girls/n samples 0.92
accuracy 0.97 len(best_seller_matrix) 105 n_girls 98 n_boys 7  -->  girls/n samples 0.93
accuracy 0.96 len

## What happens if I leave Alger in with "boys"?

One theory, which we never proved to our satisfaction, was that **Alger "trains" adolescent readers to become readers of adult bestsellers**.

In [39]:
for a in range(25):
    
    training_sources, training_samples, training_labels = get_some_data(['alger_boys', 'girls'], 2000)
    testing_sources, testing_samples, testing_labels = get_some_data(['alger_boys', 'girls'], 2000)
    
    best_seller_sources, best_seller_samples, best_seller_labels = get_some_data(['best_sellers',], 2000)

    dictionary, classifier, score = train_and_test_a_classifier(training_samples, 
                                                                training_labels, 
                                                                testing_samples, 
                                                                testing_labels, 
                                                                best_seller_samples)
    
    best_seller_corpus = [dictionary.doc2bow(text) for text in best_seller_samples]

    best_seller_matrix = matutils.corpus2dense(best_seller_corpus, len(dictionary))
    best_seller_matrix = best_seller_matrix.T
    
    results = classifier.predict(best_seller_matrix)
    
    n_girls = 0
    n_boys = 0
    for r in results:
        if r == 'girls':
            n_girls += 1
        if r == 'alger_boys':
            n_boys += 1
            
    print 'accuracy', '%.2f' % score, 'len(best_seller_matrix)', len(best_seller_matrix), \
            'n_girls', n_girls, 'n_boys', n_boys, \
            ' --> ', \
            'girls/n samples', '%.2f' % (float(n_girls) / float(len(best_seller_matrix)))

accuracy 0.97 len(best_seller_matrix) 105 n_girls 69 n_boys 36  -->  girls/n samples 0.66
accuracy 0.94 len(best_seller_matrix) 105 n_girls 69 n_boys 36  -->  girls/n samples 0.66
accuracy 0.96 len(best_seller_matrix) 105 n_girls 65 n_boys 40  -->  girls/n samples 0.62
accuracy 0.94 len(best_seller_matrix) 105 n_girls 68 n_boys 37  -->  girls/n samples 0.65
accuracy 0.94 len(best_seller_matrix) 105 n_girls 70 n_boys 35  -->  girls/n samples 0.67
accuracy 0.96 len(best_seller_matrix) 105 n_girls 70 n_boys 35  -->  girls/n samples 0.67
accuracy 0.95 len(best_seller_matrix) 105 n_girls 57 n_boys 48  -->  girls/n samples 0.54
accuracy 0.96 len(best_seller_matrix) 105 n_girls 68 n_boys 37  -->  girls/n samples 0.65
accuracy 0.96 len(best_seller_matrix) 105 n_girls 64 n_boys 41  -->  girls/n samples 0.61
accuracy 0.94 len(best_seller_matrix) 105 n_girls 66 n_boys 39  -->  girls/n samples 0.63
accuracy 0.96 len(best_seller_matrix) 105 n_girls 66 n_boys 39  -->  girls/n samples 0.63
accuracy 0