# PyYYC: Intro to Scikit-Learn
by matt whiteside, adapted from: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

## Classification Part 1: Training the Machine
Training a machine learning model is a multi-step process:
1. Feature extraction
2. Feature selection
3. Training
4. Evaluation


In [3]:
# First example - training a classifier

# Import our libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

In [4]:
# Lets test out one of datasets in sklearn
# The Newsgroup dataset

# Newsgroups are a collection of online messages organized by a subject or category
# Can we predict the category from the content of message?

# Let's pick out a few categories to work with
categories = [
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey']

trainingset = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=36)
# Terminology note: training set - data used to by machine learning model to learn

In [5]:
# traingingset is a sklearn "bunch" object
# The values were are trying to predict is stored in the "target_names" attribute
trainingset.target_names


['rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey']

In [23]:
# The training data (messages) are under the "data" attribute
print("\nA sample article:\n##------------------------------------\n{}\n##------------------------------------\n\n".format(trainingset.data[0]))

# Can you predict the category based on the message? will the machine be able to??

# The index of the correct categories are available under the "target" attribute
print("The sample messages's category index: {}, which is really category: '{}'\n".format(
    trainingset.target[0], trainingset.target_names[trainingset.target[0]]))
# Here is the size of the data were working with
print("Total messages:{}".format(len(trainingset.data)))
counts = np.unique(trainingset.target, return_counts=True)
[print("  category: {:<20} number of messages: {}".format(trainingset.target_names[cat],n)) for cat,n in zip(counts[0],counts[1])]

None


A sample article:
##------------------------------------
Organization: Penn State University
From: Robbie Po <RAP115@psuvm.psu.edu>
Subject: Re: Devils and Islanders tiebreaker????
Lines: 14

In article <C5LDI2.77u@odin.corp.sgi.com>, enolan@sharkbite.esd.sgi.com (Ed
Nolan) says:
>If the Islanders beat the Devils tonight, they would finish with
>identical records.  Who's the lucky team that gets to face the Penguins
>in the opening round?   Also, can somebody list the rules for breaking
>ties.
      As I recall, the Penguins and Devils tied for third place last year
with identical records, as well.  Poor Devils -- they always get screwed.
Yet, they should put a scare into Pittsburgh.  They always do!  Pens in 7.
-------------------------------------------------------------------------
** Robbie Po **          PGH PENGUINS!!!    "It won't be easy, but it
Contact for the '93-'94  '91 STANLEY CUP    will have greater rewards.
Penn State Lady Lions    '92 CHAMPIONS      Mountains and Vall

In [24]:
# Need to convert a message into a format that can be used by the machine learning algorithm
# ...called feature extraction

# The "bag of words" approach

# Convert the messages into a matrix:
# Each column is mapped to a word
# Each row is mapped to a message
# The matrix value represents the number of times that word appeared in the message

# E.g. 
# The sklearn CountVectorizer does this for us:
example_messages = ["Hello PyYYC, I say again Hello","Welcome to pyyyc"]

vectorizer = CountVectorizer()

# Some method names will reappear over and over in sklearn. These are part of the sklearn specification
# "fit", "transform" and "fit_transform" are examples
# fit: "learn" or "fit" the data by recording every instance of word and assigning it a column.
#   Learned data is stored in the object and nothing is returned.
# transform: using the "fitted" model saved in the object, convert the input data into a new format
# fit_transform: does both in one step
word_counts = vectorizer.fit_transform(example_messages)
word_counts


<2x6 sparse matrix of type '<class 'numpy.int64'>'
	with 7 stored elements in Compressed Sparse Row format>

In [8]:
# 2x6 sparse matrix
# There's more than 6 words, what's going on??

# In CountVectorizer, the default is for words to >1 letters, and all strings are converted to lowercase
# http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer

# What's a sparse matrix
# Many messages will have 0 instances of word. When each word has separate column this means that many of the 
# matrix entries will be zero AND it will be large.
# A sparse matrix only keeps track of non-zero entries to save space
# On the surface, it works just like a normal matrix
print(word_counts[0,1])
print(word_counts[0,5])
print(word_counts[0,:])

2
0
  (0, 0)	1
  (0, 3)	1
  (0, 2)	1
  (0, 1)	2


In [9]:
# What words are the columns in matrix mapped to?
print(vectorizer.get_feature_names())

# You can lookup the index or column of a work like so:
print(vectorizer.vocabulary_.get('pyyyc'))

['again', 'hello', 'pyyyc', 'say', 'to', 'welcome']
2


In [10]:
# So lets perform feature extraction on our messages
word_counts = vectorizer.fit_transform(trainingset.data)
word_counts.shape

(2389, 30446)

In [11]:
vectorizer.get_feature_names()[10000:10010]

['d1u',
 'd6n',
 'd6s',
 'd90',
 'd96',
 'd_jaracz',
 'da',
 'da_tinker1',
 'daaaaaaaaaaaaaaaaaaaaaay',
 'dab']

In [12]:
# Hmmm, looks like lots of quality informative words there
# We could try to train our machine learning models with this data
# Or we maybe we can improve our vectorized training data a bit by stripping out low quality data

# Feature selection

# What if we just focus on words that appear in more than one document
# We can use an option "min_df" in CountVectorizer for this
vectorizer = CountVectorizer(min_df=2)
word_counts = vectorizer.fit_transform(trainingset.data)
word_counts.shape

(2389, 17830)

In [13]:
vectorizer.get_feature_names()[16260:16270]

['tny',
 'to',
 'tobaccos',
 'tobias',
 'toby',
 'tocchet',
 'tochett',
 'tod',
 'today',
 'todays']

In [14]:
# What about the word "to". Is that helpful?
# Very frequent words like "like", "to", "and" should probably be removed

# And another issue, longer messages will have higher word counts. How do we normalize for message size?

# We can use a feature selection technique called “Term Frequency times Inverse Document Frequency”.
# Term frequency adjusts for the length of message dividing word occurances by total number of words
# Inverse document frequency downweights words that appear in many documents (the "to"s)

# Lets run our count data through a TF-IDF transformation:
tfidf_transformer = TfidfTransformer().fit(word_counts)
adjusted_word_counts = tfidf_transformer.transform(word_counts)
adjusted_word_counts.shape


(2389, 17830)

In [15]:
word_counts.toarray()[0,8740:8850]

array([2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0],
      dtype=int64)

In [16]:
adjusted_word_counts.toarray()[0,8740:8850]

array([0.18462397, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.02909454, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

In [17]:
# Finally, data is ready.
# Lets teach a machine to predict a category from a newsgroup message:

# Using defaults
classifier = SGDClassifier(max_iter=10, tol=None)

# Give it our features, and the correct answers
classifier.fit(adjusted_word_counts, trainingset.target)

# And now lets use it to predict categories on messages the classifier has never seen
# Don't forget to perform the same feature selection as before
test_messages = ["I love to score goals", "I love to hit home runs"]
test_word_counts = vectorizer.transform(test_messages)
test_adj_word_counts = tfidf_transformer.transform(test_word_counts)
predictions = classifier.predict(test_adj_word_counts)

# Is the machine right?
for msg, category in zip(test_messages, predictions):
    print('%r => %s' % (msg, trainingset.target_names[category]))

'I love to score goals' => rec.sport.hockey
'I love to hit home runs' => rec.sport.baseball


In [18]:
# Seems good
# But just how good?

# A test set is part of the data that is withheld from training, 
# so that the classifier's performance can be assessed
# fetch_20newsgroups withholds this data for us
testset = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=36)
len(testset.data)


1590

In [19]:
test_word_counts = vectorizer.transform(testset.data)
test_adj_word_counts = tfidf_transformer.transform(test_word_counts)
predictions = classifier.predict(test_adj_word_counts)

# Accuracy
print("\nAccuracy: {}\n".format(accuracy_score(testset.target, predictions)))


Accuracy: 0.9610062893081761



In [21]:
# Accuracy is the the total number correct assignments (positive or negative) out of all possible assignments

# Is Accuracy a reliable score?

# Accuracy paradox
# In skewed datasets, you can get high accuracy values by predicted all positive or all negative

# Recall: 
#   messages correctly assigned to that category / total messages in category (True Positives / All Positives)
# i.e. Out of all the messages in a given category, how many did the classifier correctly predict

# Precision: 
#   messages correctly assigned to that category / total messages predicted to be in that category (True Positives / True Positives)
# i.e. Out of all the messages in a predicted by the classfier to be in a category, how many messages did the classifier correctly predict 

# F1 is an balanced average of recall and precision... values range betwen 0 and 1 with equal contributions
# precision and recall

# Also look at ROC curve, matthews correlation coefficent.

print(classification_report(testset.target, predictions, target_names=testset.target_names))

                    precision    recall  f1-score   support

         rec.autos       0.95      0.95      0.95       396
   rec.motorcycles       0.97      0.95      0.96       398
rec.sport.baseball       0.96      0.95      0.96       397
  rec.sport.hockey       0.96      0.98      0.97       399

       avg / total       0.96      0.96      0.96      1590



In [22]:
# How does another machine learning method perform?
from sklearn.svm import LinearSVC

svm_classifier = LinearSVC() # Again using all defaults
svm_classifier.fit(adjusted_word_counts, trainingset.target)
svm_predictions = svm_classifier.predict(test_adj_word_counts)
print(classification_report(testset.target, svm_predictions, target_names=testset.target_names))

                    precision    recall  f1-score   support

         rec.autos       0.95      0.98      0.96       396
   rec.motorcycles       0.98      0.95      0.97       398
rec.sport.baseball       0.97      0.95      0.96       397
  rec.sport.hockey       0.97      0.98      0.97       399

       avg / total       0.97      0.97      0.97      1590



In [None]:
# There are many other classifiers to try and many possible parameters to try to optimize
# Next topic, how to coordinate the search...