# Machine learning
From a [session taught at NICAR 2015](https://github.com/cjdd3b/nicar2015/tree/master/machine-learning):

>For this exercise, we'll be training a simple classifier that learns how to categorize bills from the California Legislature based only on their titles. Along the way, we'll focus on three steps critical to any supervised learning application: feature engineering, model building and evaluation.

Original data: https://github.com/cjdd3b/nicar2015/blob/master/machine-learning/data/training.txt

In [1]:
from sklearn import preprocessing
from sklearn import cross_validation
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# STEP 1: DATA IMPORT AND PREPROCESSING

Here we're taking in the training data and splitting it into two lists: One with the text of
each bill title, and the second with each bill title's corresponding category. Order is important.
The first bill in list 1 should also be the first category in list 2.

In [2]:
training = [line.strip().split('|') for line in open('data/training.txt','r').readlines()]

In [3]:
training[:10]

[['An act to amend Section 44277 of the Education Code, relating to teachers.',
  'Education'],
 ['An act to add Section 8314.4 to the Government Code, relating to public funds.',
  'Public Services'],
 ['An act to amend Sections 226, 233, and 234 of, and to add Article 1.5 (commencing with Section 245) to Chapter 1 of Part 1 of Division 2 of, the Labor Code, relating to employment.',
  'Labor and Employment'],
 ['An act to amend Sections 12920, 12921, 12926, 12940, and 12955.2 of the Government Code, relating to employment.',
  'Labor and Employment'],
 ['An act to amend Section 186.8 of, and to add Section 236.4 to, the Penal Code, relating to human trafficking.',
  'Crime'],
 ['An act to amend Section 13823.17 of the Penal Code, relating to domestic violence.',
  'Social Issues'],
 ['An act to add Sections 5017.1, 5017.5, and 5103.5 to the Business and Professions Code, relating to accountants.',
  'Business and Consumers'],
 ['An act to add Section 15817.5 to the Government Code, r

In [4]:
text = [t[0] for t in training if len(t) > 1]
labels = [t[1] for t in training if len(t) > 1]

A little bit of cleanup for scikit-learn's benefit. Scikit-learn models wants our categories to be numbers, not strings. The LabelEncoder performs this transformation.

In [5]:
encoder = preprocessing.LabelEncoder()
correct_labels = encoder.fit_transform(labels)

In [7]:
print len(correct_labels)
print correct_labels

5750
[10 31 25 ..., 19 19 27]


# STEP 2: FEATURE EXTRACTION

These two lines use scikit-learn helpers to transform our training data into a document/term matrix.

In [8]:
vectorizer = CountVectorizer()
data = vectorizer.fit_transform(text).todense()

In [9]:
print data
print data.shape

[[0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]]
(5750, 7545)


# STEP 3: MODEL BUILDING

In [10]:
# multinomial naive bayes

model = MultinomialNB()
fit_model = model.fit(data, correct_labels)

In [11]:
fit_model

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

# STEP 4: EVALUATION 

In [12]:
# k-fold cross-validation, with 10 folds

scores = cross_validation.cross_val_score(model, data, correct_labels, cv=10)
print "Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)

Accuracy: 0.55 (+/- 0.20)




# STEP 5: APPLYING THE MODEL

In [13]:
docs_new = ["Public postsecondary education: executive officer compensation.",
                "An act to add Section 236.3 to the Education code, related to the pricing of college textbooks.",
                "Political Reform Act of 1974: campaign disclosures.",
                "An act to add Section 236.3 to the Penal Code, relating to human trafficking."
            ]

In [14]:
test_data = vectorizer.transform(docs_new)

In [15]:
for i in xrange(len(docs_new)):
    print '%s -> %s' % (docs_new[i], encoder.classes_[model.predict(test_data.toarray()[i])])

Public postsecondary education: executive officer compensation. -> ['Education']
An act to add Section 236.3 to the Education code, related to the pricing of college textbooks. -> ['Education']
Political Reform Act of 1974: campaign disclosures. -> ['Campaign Finance and Election Issues']
An act to add Section 236.3 to the Penal Code, relating to human trafficking. -> ['Crime']


# Improvements

In [16]:
# change how features are interpreted

# ignore "stop words" that are too common
# provide threshold of words that are so infrequent we can ignore them
# make all words lowercase

vectorizer = CountVectorizer(stop_words='english', min_df=2, lowercase=True, analyzer='word')
data = vectorizer.fit_transform(text).todense()
model = MultinomialNB()
fit_model = model.fit(data, correct_labels)
scores = cross_validation.cross_val_score(model, data, correct_labels, cv=10)
print "Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)

Accuracy: 0.61 (+/- 0.13)


In [17]:
# use a different model

# Random Forest classifier

model = RandomForestClassifier(n_estimators=10, random_state=0)
fit_model = model.fit(data, correct_labels)
scores = cross_validation.cross_val_score(model, data, correct_labels, cv=10)
print "Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)

Accuracy: 0.67 (+/- 0.09)


In [18]:
docs_new = ["Public postsecondary education: executive officer compensation.",
                "An act to add Section 236.3 to the Education code, related to the pricing of college textbooks.",
                "Political Reform Act of 1974: campaign disclosures.",
                "An act to add Section 236.3 to the Penal Code, relating to human trafficking."
            ]

test_data = vectorizer.transform(docs_new)

for i in xrange(len(docs_new)):
    print '%s -> %s' % (docs_new[i], encoder.classes_[model.predict(test_data.toarray()[i])])

Public postsecondary education: executive officer compensation. -> ['Education']
An act to add Section 236.3 to the Education code, related to the pricing of college textbooks. -> ['Education']
Political Reform Act of 1974: campaign disclosures. -> ['Campaign Finance and Election Issues']
An act to add Section 236.3 to the Penal Code, relating to human trafficking. -> ['Crime']
