#OSP Syllabus Classification

Is a document a document or a syllabus? Using tagged training data, we are about to find out...

In [21]:
from osp.corpus.syllabus import Syllabus
import pandas as pd
import numpy as np
import scipy

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import ShuffleSplit, cross_val_score
from sklearn.grid_search import GridSearchCV

In [2]:
raw_training_df = pd.read_csv('/home/ubuntu/osp-tagging.csv')
print(raw_training_df.columns, raw_training_df.size)

Index(['﻿"id"', 'title', 'text', 'url', 'tags'], dtype='object') 84340


There are many syllabi in the training data that were not tagged. They have an empty value in the tags field that is stored as a float. Remove them, as well as tagged entries with a null in the text field.

In [11]:
training_df = raw_training_df[raw_training_df.tags.apply(
        lambda x: isinstance(x, str)) & ~raw_training_df.text.isnull()]

print("Raw row count: {}, labeled row count: {}".format(raw_training_df.shape, training_df.shape))

Raw row count: (16868, 5), labeled row count: (772, 5)


The training data is tagged from the Overview project interface. Several different tags were available:

- Course Description
- "Course Description,Not Syllabus"
- Lab Syllabus
- Lesson or Lecture
- "No Citations,Syllabus"
- Not Syllabus
- Reading List
- "Reading List,Syllabus"
- Syllabus
- "Syllabus,Lab Syllabus"
- "Syllabus,No Citations"
- "Syllabus,No Citations,Odd for some reason"
- "Syllabus,Odd for some reason"
- "Syllabus,Odd for some reason,No Citations"

This is a classifier to split syllabi from not, so as long as a document is a syllabus, regardless whether it has citations or not, or whether it is odd, it will be considered a syllabus.

In [12]:
def is_syllabus_tag(tag):
    try:
        return 'syllabus' in tag.lower() and 'not syllabus' not in tag.lower()
    except AttributeError:
        return False

is_syllabus = training_df.tags.apply(is_syllabus_tag)
positive_examples_df = training_df[is_syllabus]
negative_examples_df = training_df[~is_syllabus]

print('Positive examples: {}, negative examples: {}'.format(positive_examples_df.shape, negative_examples_df.shape))

Positive examples: (301, 5), negative examples: (471, 5)


We tokenize the syllabus text in the positive and negative examples, and featurize them for a classifier.

First pass: tf-idf features of text tokens, classified using naive bayes.

In [18]:
text_clf = Pipeline([('vect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('clf', MultinomialNB()),
])

In [20]:
ss = ShuffleSplit(len(training_df), test_size=0.25, random_state=983214)
cv_results = cross_val_score(text_clf, training_df.text.values, is_syllabus.values, cv=ss)
cv_results.mean()

0.87150259067357505

In [26]:
ss = ShuffleSplit(len(training_df), test_size=0.25, random_state=983214)
cv_results = cross_val_score(text_clf, training_df.text.values, is_syllabus.values, cv=ss, scoring='roc_auc')
cv_results.mean()

0.94591548490478239

We get 87.15% mean accuracy and 94.59% mean ROC using out-of-the-box features and the multinomial NB classifier. Let's try tuning the feature set a little.

In [24]:
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    #'vect__max_features': (None, 5000, 10000, 50000),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    #'tfidf__use_idf': (True, False),
    #'tfidf__norm': ('l1', 'l2')
}

grid_search = GridSearchCV(text_clf, parameters, n_jobs=-1, verbose=1)
grid_search.fit(training_df.text.values, is_syllabus.values)
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Fitting 3 folds for each of 6 candidates, totalling 18 fits


[Parallel(n_jobs=-1)]: Done   1 jobs       | elapsed:   10.3s
[Parallel(n_jobs=-1)]: Done  12 out of  18 | elapsed:  1.3min remaining:   38.8s
[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed:  1.8min finished


Best score: 0.845
Best parameters set:
	vect__max_df: 0.5
	vect__ngram_range: (1, 2)


In this grid search across parameters, having a cutoff that eliminates words with document frequency above 0