#Part 1.  Building a Text Bayesian Classifier
I wanted a way to automatically classify Wikipedia articles into certain categories.  For this, we'll be using Python 3.4, along with Python's Natural Language Toolkit (NLTK) and Wikipedia modules, both installable through pip (or pip3, as the case may be). 


#Data Defs
Here we list the data that will define our "TRAINING" data set, the data we feed the machine to improve it's learning and help it make better decisions when presented with test cases later on.  


In [1]:
TRAINERS = [{'person':['Bill_Clinton','George_W_Bush','Morgan_Freeman','Susan_Sarandon','Nicole_Kidman','Albert_Einstein',
                       'Stephen_Hawking','Rand_Paul','Charlie_Sheen','Jennifer_Lawrence','Jimi_Hendrix','Simone_Simons',
                       'Jesus_Christ','Immanuel_Kant','Eric_Idle','Robin_Williams','Oprah_Winfrey','Whoopi_Goldberg',
                       'Kendra_Wilkinson','Kathie_Lee_Gifford','Rachael_Ray','Milton_Friedman','Rush_Limbaugh',
                       'Alan_Colmes',
                       'Sean_Hannity','Charles_Krauthammer','Ted_Nugent','Angus_Young','Axl_Rose','Dave_Mustaine',
                       'Mick_Jagger','John_Lennon','Paul_McCartney','Ringo_Starr','George_Harrison','Keith_Richards',
                       'Cindy_Crawford',                     
                      ]},
            {'city':['Anchorage','Auckland','Bali','Beijing','Boise','Boston','Calgary','Chicago','Dallas','Hiroshima',
                     'Kabul','Kansas_City','London','Minneapolis','Mumbai',
                     'New_Brunswick','New_York_City','Okinawa','Paris','Seattle','Soweto','Sydney','Tokyo',
                    'Winnipeg','Spokane','Bozeman','Cincinnati','Charlotte','Atlanta','Orlando','Tampa',
                    'Prague','Warsaw','Minsk','Vilnius','Tbilisi'
                    ]},
            {'movie':['Pretty_Woman','Star_Wars','Timecop','Billy_Bathgate','Die_Hard','12_Monkeys','28_Days',
                      'Bull_Durham','Passion_of_The_Christ','Top_Gun','Brokeback_Mountain','Inception','Toy_Story',
                      'Django_Unchained','The_Empire_Strikes_Back','The_Fifth_Element','Dawn_Of_The_Planet_Of_The_Apes',
                     'Rio_2','How_To_Train_Your_Dragon_2','Big_Eyes','Still_Alice','The_Grand_Budapest_Hotel',
                     'The_Lego_Movie','Jamesy_Boy','The_Nut_Job','The_Legend_Of_Hercules','Kidnapped_For_Christ',
                     ]},
            {'animal':['Orca','Golden_Retriever','Blue_Whale','Cockroach','Mouse','Goldfish','Chinook_Salmon','Dog',
                       'Right_Whale','Snowshoe_Hare','Whitetail_Deer','Ruffed_Grouse','Canadian_Goose',
                       'American_Black_Bear','Grizzly_Bear','Raccoon','Badger','Wolverine','Skunk','Weasel',
                       'African_Elephant',
                      'Asian_Elephant','Polar_Bear','Giraffe','Black_Mamba','Saltwater_Crocodile','Seasnake',
                       'American_Alligator','Komodo_Dragon','Seahorse','Ferret','Snowshoe_Hare','Gerbil','Hamster',
                       'Muskrat',
                      'Crow','Bald_Eagle','Bluebird','Trilobite','Oyster','Jellyfish','Walleye','Great_Blue_Heron','Swan',
                      ]},
            {'study':['Astronomy','Physics','Mathematics','Kinesiology','Sociology','Political_theory','Psychology',
                     'Chemistry','Radiology','Cardiology','Biochemistry','Cosmology','Feminism','Ethics','Biology',
                     'Industrial_Engineering','Cosmetology','Graphic_Design','Interior_Design','Pharmacy','Nutrition',
                      'Art',
                     'Poetry','Literature','Humanities','Computer_Programming','Cooking','Architecture','Marine_Biology',
                     ]},
            {'holiday':['Christmas','Thanksgiving','Valentines_Day','Boxing_Day','Memorial_Day','Veterans_Day',
                       'Columbus_Day','Flag_Day','Labor_Day','Armistice_Day','Victoria_Day','Rememberance_Day','Easter',
                       'Discovery_Day','Orange_walk','Easter_Monday','Samhain','Halloween'
                       ]}
           ]

#Test data definition
This data comprises the testdata we want to feed later to the machine to test how well it has learned and see if it can classify articles correctly now.  

In [17]:
#Now, build a testset of data and possible categories to test the classifier against...
CATEGORIES = [ 'person','city','animal','movie','study','holiday']
TESTDATA = ['Abraham_Lincoln','Danny_Devito','Rheumatology','Oarfish','Lobster','United_Passions','Air_Heads',
           'Benjamin_Franklin','Grace_Kelly','Blue_Whale','Grey_Seal','Glenn_Howerton','Charlie_Day','The_Last_Airbender',
           'Susan_Sarandon','Joanna_Kerns','Snowy_Owl','Great_White_Shark','Mako_Shark','Mahi_Mahi']


#Writing the code
We use Python's NLTK (Natural Language Toolkit) and wikipedia modules for this exercise, along with the use of 'random' later on in the cross_validation functions. These modules are all we need to get our machine learning classifier working.  


In [3]:
#!/usr/bin/python3
import wikipedia
import nltk
from nltk.classify import NaiveBayesClassifier

###Trainer_length function
This really only shows the length of each training list and is there for our own convenience when training the data set.  Overfitting can become a real issue and knowing the length of each list can be useful at these times.

In [4]:
def trainer_length(trainer):
    for t in trainer:
        for key, vals in t.items():
            print(key+'\t'+str(len(vals)))
    return

###Feature selection and labeling parts of speech
In order to optimize our classifier, we need to select 'features' of sentences for our classifier to learn from.  These, in Python, are in the form of dictionaries a la { 'feature' : 'word' }  with optional tagging thrown in for training sessions, e.g. ({'PRP':'Their'},'person') or left off, {'PRP':'Their'} for testing/validation sessions.  The <i>label_set_pos</i> function below tags each word in sentence with it's proper part of speech (POS) tag and returns the word/label tag as a dictionary.  The dictionary is then labeled with it's proper category for future classification and appended to labeled_words.

In [6]:
def label_set_pos(TRAINERS):
    labeled_words=[]
    for trainingSet in TRAINERS:
        for key,vals in trainingSet.items():
            for linkid in vals:
                #PAGECONTENT=wikipedia.page(linkid).content
                PAGECONTENT=wikipedia.summary(linkid)
                PAGECONTENT=nltk.word_tokenize(PAGECONTENT)
                POSTAG=nltk.pos_tag(PAGECONTENT)
                labeled_words+=[(dict((b,a) for (a,b) in POSTAG),key)]
    return labeled_words

###Classifying unknowns
This function is fairly similar to the above.  However, it does not label the Part Of Speech dictionary after being returned, since it does not what category the article belongs to, and instead returns an unlabeled dictionary of the words in the unclassified article summary.  These words are similarly tagged, using Python's Natural Language Toolkit, NLTK, and once tagged, then run through the NLTK's classify function, which returns a string with the classifiers best guess as to which category the article belongs. 

In [8]:
import json
def classify_unknowns(classifier,TESTDATA):
    featureset=[]
    for article in TESTDATA:
        #CONTENT=wikipedia.page(article).content
        CONTENT=wikipedia.summary(article)
        CONTENT=nltk.word_tokenize(CONTENT)
        CONTENT=nltk.pos_tag(CONTENT)
        featureset=dict((b,a) for (a,b) in CONTENT)
        fs=json.dumps({article:featureset})
        print("Name: {0:10}  ".format(article),
                  "My Guess: {0:10}".format(classifier.classify(featureset)))
    return
    

#Start of actual work


In [None]:
#Load old classifier if it exists
import pickle
ifp = open('myclassifier.pickle','wb')
classifier=pickle.load(ifp)
ifp.close()

In [10]:
#Shows the length of the training data set we train off
trainer_length(TRAINERS)

person	37
city	36
movie	27
animal	44
study	29
holiday	18


In [15]:
stopwords=nltk.corpus.stopwords.words('english')
labeled_words=[]
labeled_words=label_set_pos(TRAINERS)
classifier=NaiveBayesClassifier.train(labeled_words)
print(classifier)
import pickle
ofp = open('myclassifier.pickle','wb')
pickle.dump(classifier,ofp)
ofp.close()
print('new classifier saved!')

<nltk.classify.naivebayes.NaiveBayesClassifier object at 0x7f7697de3828>
new classifier saved!


Saving the classifier will allow us to load it later and save us the tedious process of waiting for the classfier and trainer to finish.

In [16]:
best_features=classifier.most_informative_features(25)
print(classifier.show_most_informative_features(15))

Most Informative Features
                     PRP = 'It'            movie : person =     13.2 : 1.0
                    PRP$ = 'his'          person : city   =     12.0 : 1.0
                      CD = None            study : city   =     10.6 : 1.0
                     JJS = None            study : city   =     10.1 : 1.0
                     VBD = 'was'            city : study  =      9.4 : 1.0
                     VBP = None           person : animal =      8.7 : 1.0
                     VBP = 'are'          animal : person =      8.4 : 1.0
                     POS = None            study : city   =      8.3 : 1.0
                      CC = 'or'           holida : city   =      8.1 : 1.0
                     WDT = 'that'          study : city   =      7.8 : 1.0
                      MD = 'can'           study : city   =      7.6 : 1.0
                     VBD = None           animal : city   =      7.6 : 1.0
                    PRP$ = 'its'            city : person =      7.6 : 1.0

In [18]:
classify_unknowns(classifier,TESTDATA)

Name: Abraham_Lincoln   My Guess: person    
Name: Danny_Devito   My Guess: person    
Name: Rheumatology   My Guess: study     
Name: Oarfish      My Guess: animal    
Name: Lobster      My Guess: study     
Name: United_Passions   My Guess: movie     
Name: Air_Heads    My Guess: movie     
Name: Benjamin_Franklin   My Guess: city      
Name: Grace_Kelly   My Guess: person    
Name: Blue_Whale   My Guess: animal    
Name: Grey_Seal    My Guess: animal    
Name: Glenn_Howerton   My Guess: person    
Name: Charlie_Day   My Guess: person    
Name: The_Last_Airbender   My Guess: movie     
Name: Susan_Sarandon   My Guess: person    
Name: Joanna_Kerns   My Guess: person    
Name: Snowy_Owl    My Guess: animal    
Name: Great_White_Shark   My Guess: animal    
Name: Mako_Shark   My Guess: animal    
Name: Mahi_Mahi    My Guess: animal    


####In this set of 20 tests, the classifier guesses 18, or just over 90%, correctly!