Mestrado em Modelagem Matematica da Informacao
----------------------------------------------
Disciplina: Modelagem e Mineracao de Dados
------------------------------------------

Master Program - Mathematical Modeling of Information
-----------------------------------------------------
Course: Data Mining and Modeling
--------------------------------

Professor: Renato Rocha Souza
-----------------------------

### Topic: Classification and Filtering 

Information on the Python Packages used:  
http://docs.python.org/library/sqlite3.html  
http://docs.python.org/library/re.html  
http://www.feedparser.org/  
http://docs.python.org/2/library/tkinter.html  
http://www.tkdocs.com/tutorial/index.html  

In [55]:
import os
import sys
import re
import math
import feedparser
import time
import string
import datetime
import sqlite3
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup

Specifying the path to the files:

In [61]:
outputs = "/home/rsouza/Dropbox/Renato/ModMinDados/outputs/"

dbfile1 = "treino.sqlite"
dbfile2 = "treinoblogs.sqlite"
outblog = "blogoutputrss.xml"

db_teste = os.path.join(outputs,dbfile1)
db_blog = os.path.join(outputs,dbfile2)
rssblogoutput = os.path.join(outputs,outblog)

stopwords = nltk.corpus.stopwords.words('portuguese') + nltk.corpus.stopwords.words('portuguese')

#### First block of functions: feature extraction:

In [62]:
def getwords(html):
    '''Remove the HTML tags and cleans the feeds files;splits the sentences 
    by the non alpha characters and converts all words to lowercase.
    Ignores bigger and too small words'''
    #splitter = re.compile('\\W*', flags=re.U)
    #splitter=re.compile(r'[^A-Z^a-z]+', flags=re.U)
    #words=[s.lower() for s in splitter.split(html) if len(s)>2 and len(s)<20]
    words = BeautifulSoup(html).findAll(text=True)[0].split()
    words = [w.strip(string.punctuation) for w in words if len(w)>2 and len(w)<20 and w not in stopwords]
    return dict([(w,1) for w in words])

In [65]:
def entryfeatures(entry):
    '''Used when our features are not documents, but feeds rss'''
    splitter=re.compile('\\W*')
    f={}
    # Extract title words
    titlewords=[s.lower() for s in splitter.split(entry['title']) if len(s)>2 and len(s)<20 and s not in stopwords]
    for w in titlewords: f['Title:'+w]=1
    # Extract summary words
    summarywords=[s.lower() for s in splitter.split(entry['summary']) if len(s)>2 and len(s)<20 and s not in stopwords]
    # Count lowercase words
    uc=0
    for i in range(len(summarywords)):
        w=summarywords[i]
        f[w]=1
        if w.isupper(): uc+=1
        # Features are words in the feed summary
        if i < len(summarywords)-1:
            twowords=' '.join(summarywords[i:i+1])
            f[twowords]=1
    # Save publisher information
    f['Publisher:'+entry['publisher']]=1
    # Too many uppercase words are indicated
    if float(uc)/len(summarywords)>0.3:
        f['MAIUSCULAS']=1
    return f

#### Second block of functions: classification

In [30]:
class classifier:
    ''' Represents the classifier, storing what's learnt from training.
    fc saves the combination words/categories {'word': {'bad': 3, 'good': 2}}
    and cc is a dictionary that stores the number of times a category was used
    {'bad': 3, 'good': 2}. Will be used when no DB is used.
    Getfeatures is the feature extraction function to be used'''
    def __init__(self, getfeatures, filename=None, usedb=False):
        self.fc={}
        self.cc={}
        self.getfeatures = getfeatures
        self.usedb = usedb
    
    def setdb(self,dbfile):
        '''When using a database and not dictionaries, to persist the information
        across sessions'''
        self.con = sqlite3.dbapi2.connect(dbfile)    
        self.con.execute(u'create table if not exists fc(feature,category,count)')
        self.con.execute(u'create table if not exists cc(category,count)')

    def fcount(self,f,cat):
        '''Returns the number of times a feature appears in a category'''
        if not self.usedb:
            if f in self.fc and cat in self.fc[f]: 
                return float(self.fc[f][cat])
            else: 
                return 0
        else:
            query = u'select count from fc where feature="{}" and category="{}"'
            res = self.con.execute(query.format(f,cat)).fetchone()
            if res == None: 
                return 0
            else: 
                return float(res[0])

    def incf(self,f,cat):
        '''Creates a feature/category pair if not exists, or increase the number
        if feature exists in a category'''
        if not self.usedb:
            self.fc.setdefault(f,{})
            self.fc[f].setdefault(cat,0)
            self.fc[f][cat] += 1
        else:
            count=self.fcount(f,cat)
            if count == 0:
                self.con.execute(u'insert into fc values ("{}","{}",1)'.format(f,cat))
            else:
                query = u'update fc set count={} where feature="{}" and category="{}"'
                self.con.execute(query.format(count+1,f,cat)) 

    def incc(self,cat):
        '''Increases the number of occurrences of a category'''
        if not self.usedb:
            self.cc.setdefault(cat,0)
            self.cc[cat] += 1        
        else:
            count=self.catcount(cat)
            if count == 0:
                self.con.execute(u'insert into cc values ("{}",1)'.format(cat))
            else:
                query = u'update cc set count={} where category="{}"'
                self.con.execute(query.format(count+1,cat))    

    def catcount(self,cat):
        '''Counts the numer of itens in a category'''
        if not self.usedb:
            if cat in self.cc:
                return float(self.cc[cat])
            else:
                return 0
        else:
            query = u'select count from cc where category="{}"'
            res=self.con.execute(query.format(cat)).fetchone()
            if res == None:
                return 0
            else:
                return float(res[0])

    def categories(self):
        '''Lists all the categories'''        
        if not self.usedb: return self.cc.keys()
        else:
            cur=self.con.execute(u'select category from cc');
            return [d[0] for d in cur]

    def totalcount(self):
        ''' Returns the total numer of itens'''
        if not self.usedb: return sum(self.cc.values())
        else:
            res=self.con.execute(u'select sum(count) from cc').fetchone();
            if res==None: return 0
            else: return res[0]

    def train(self,item,cat):
        '''Receives an item (a bag of features) and a category, and increases
        the relative number of this category for all the features'''
        features=self.getfeatures(item)
        for f in features:
            self.incf(f,cat)
        self.incc(cat)
        if self.usedb: self.con.commit()

    def fprob(self,f,cat):
        '''Calculates the probability of a feature to be within a category'''
        if self.catcount(cat)== 0:
            return 0
        return self.fcount(f,cat)/float(self.catcount(cat))

    def weightedprob(self,f,cat,prf,weight=1.0,ap=0.5):
        '''Calculates the probability of a feature to appear in a certain category
        as fprob does, but assuming an initial value and changing according to 
        the training. That minimizes the effect of a rare word to be classified
        erroneously'''
        basicprob=prf(f,cat)
        totals=sum([self.fcount(f,c) for c in self.categories()])
        bp=((weight*ap)+(totals*basicprob))/(weight+totals)
        return bp

In [31]:
class naivebayes(classifier):
    '''Extends classifier class overriding __init__ and adding specific functions
    to classify documents using naive bayes'''
    
    def __init__(self, getfeatures, usedb=False):
        classifier.__init__(self,getfeatures)
        self.thresholds = {}
        self.usedb = usedb
        
    def docprob(self,item,cat):
        '''Calculates the probability of a document to be within a given
        category multiplying all the features probabilities to be in this category'''
        features=self.getfeatures(item)
        p=1
        for f in features: 
            p*=self.weightedprob(f,cat,self.fprob)
        return p

    def prob(self,item,cat):
        catprob=self.catcount(cat)/float(self.totalcount())
        docprob=self.docprob(item,cat)
        return docprob*catprob

    def setthreshold(self,cat,t):
        self.thresholds[cat]=t

    def getthreshold(self,cat):
        if cat not in self.thresholds: 
            return 1.0
        return self.thresholds[cat]

    def classify(self, item, default=None):
        '''Finds the most probably category to be set, and apply this
        classification, given that it satisfies a minimum threshold, compared
        to the second best category to classify; otherwise sets to "None"'''        
        probs = {}
        maximum = 0.0
        #best = None
        for cat in self.categories():
            probs[cat] = self.prob(item, cat)
            if probs[cat] > maximum: 
                maximum = probs[cat]
                best = cat
        for cat in probs:
            if cat == best:
                continue
            if probs[cat]*self.getthreshold(best) > probs[best]: 
                return default
        return best

In [32]:
class fisherclassifier(classifier):
    '''Extends classifier class overriding __init__ and adding specific functions
    to classify documents using fisher method'''

    def __init__(self,getfeatures, usedb=False):
        classifier.__init__(self, getfeatures)
        self.minimums = {}
        self.usedb = usedb
        
    def cprob(self,f,cat):
        '''Returns the frequency of the feature in a category divided
        by the frequency in all categories'''
        clf=self.fprob(f,cat)
        if clf == 0:
            return 0
        freqsum=sum([self.fprob(f,c) for c in self.categories()])
        p=float(clf)/(freqsum)
        return p

    def invchi2(self,chi, df):
        m = chi / 2.0
        sum = term = math.exp(-m)
        for i in range(1, df//2):
            term *= m / i
            sum += term
        return min(sum, 1.0)

    def prob(self,item,cat):
        '''Multipy all the categories, applies the natural log
        and uses the inverse chi2 to calculate probabilty'''
        p = 1
        features = self.getfeatures(item)
        for f in features:
            p *= (self.weightedprob(f,cat,self.cprob))
        fscore = -2*math.log(p)
        return self.invchi2(fscore,len(features)*2)

    def setminimum(self,cat, minimum):
        self.minimums[cat] = minimum

    def getminimum(self,cat):
        if cat not in self.minimums: return 0
        return self.minimums[cat]

    def classify(self, item, default=None):
        '''Applies fisher to all categories to find the best result, given 
        that it satisfies a minimum threshold, otherwise sets to "None"'''
        best = default
        maximum = 0.0
        for c in self.categories():
            p = self.prob(item,c)
            if p>self.getminimum(c) and p > maximum:
                best = c
                maximum = p
        return best

#### Third block of functions: reading files or searching for feeds

In [None]:
def blogread(file_or_subject, classifier, search=True):
    '''Receives an url to search Google for blogs in a given subject, or a 
    rss xml file with saved feeds. Tries to classify the entries'''
    if search:
        generic = 'http://www.google.com/search?q={}&hl=pt-BR&tbm=blg&output=rss'
        url = generic.format(file_or_subject)
        f = feedparser.parse(url)
    else:
        f = feedparser.parse(file_or_subject)
    for entry in f['entries']:
        print(u'\n-----')
        print(u'Title:     '+ entry['title'])
        print(u'Publisher: '+ entry['publisher'])
        print(u'Date Published: ',datetime.datetime.fromtimestamp(time.mktime(entry['updated_parsed'])))
        print(u'\n-----')        
        print(entry['summary'])
        guess = classifier.classify(entry)
        print(u'Suggested: {}'.format(guess))
        cl = raw_input('Enter category or press <enter> to accept suggestion: ').lower() #does not work in Ipython Notebook
        txt = 'Title:     '+entry['title']
        txt = txt+u'\n'+u'Publisher: '+entry['publisher']
        txt = txt+u'\n'+entry['summary']
        txt = txt+u'\n'+u'Suggested: {}'.format(guess)
        if cl == ''.strip() and guess:
            cl = guess
        print(u'Category "{}" chosen'.format(cl))
        classifier.train(entry,cl)

#### Fourth block of functions: instantiating and training classifiers

In [34]:
def sampletrain(cl):
    print('Running sampletrain to train the classifier...')
    cl.train('Nobody owns the water.','good')
    cl.train('the quick rabbit jumps fences','good')
    cl.train('buy pharmaceuticals now','bad')
    cl.train('make quick money at the online casino','bad')
    cl.train('the quick brown fox jumps','good')

In [35]:
def probabilidades_palavras():
    cl = classifier(getwords)
    print('\n')    
    sampletrain(cl)
    
    print('How many times "quick" --> "good": {}'.format(cl.fcount('quick','good')))
    print('How many times "quick" --> "bad": {}'.format(cl.fcount('quick','bad')))
    print('\nProbability of "quick" given that "good": {}'.format(cl.fprob('quick','good')))
    print('Probability of "money" given that "good" (fprob): {}'.format(cl.fprob('money','good')))
    print('Weighted probability of "money" given that "good" (weightedprob): {}'.format(cl.weightedprob('money','good',cl.fprob)))

    print('\nTraining again with the same documents...\n')
    sampletrain(cl)

    print('\nProbability of "money" given that "good" (fprob): {}'.format(cl.fprob('money','good')))
    print('Weighted probability of "money" given that "good" (weightedprob): {}\n'.format(cl.weightedprob('money','good',cl.fprob)))

In [36]:
probabilidades_palavras()



Running sampletrain to train the classifier...
How many times "quick" --> "good": 2.0
How many times "quick" --> "bad": 1.0

Probability of "quick" given that "good": 0.666666666667
Probability of "money" given that "good" (fprob): 0.0
Weighted probability of "money" given that "good" (weightedprob): 0.25

Training again with the same documents...

Running sampletrain to train the classifier...

Probability of "money" given that "good" (fprob): 0.0
Weighted probability of "money" given that "good" (weightedprob): 0.166666666667



In [37]:
def probabilidades_documentos_bayes():
    cl = naivebayes(getwords)
    print('\n')    
    sampletrain(cl)
    
    print('Classifying "quick rabbit": {}'.format(cl.classify('quick rabbit', default='unknown')))
    print('Classifying "quick money": {}'.format(cl.classify('quick money', default='unknown')))
    
    print('\nSetting the threshold up...')
    cl.setthreshold('bad',3.0)

    print('Classifying "quick money": {}'.format(cl.classify('quick money', default='unknown')))
    
    print('\nTraining again with the same documents (10x)...')
    for i in range(10): sampletrain(cl)
    
    print('\nClassifying "quick money": {}'.format(cl.classify('quick money', default='unknown')))

In [38]:
probabilidades_documentos_bayes()



Running sampletrain to train the classifier...
Classifying "quick rabbit": good
Classifying "quick money": bad

Setting the threshold up...
Classifying "quick money": unknown

Training again with the same documents (10x)...
Running sampletrain to train the classifier...
Running sampletrain to train the classifier...
Running sampletrain to train the classifier...
Running sampletrain to train the classifier...
Running sampletrain to train the classifier...
Running sampletrain to train the classifier...
Running sampletrain to train the classifier...
Running sampletrain to train the classifier...
Running sampletrain to train the classifier...
Running sampletrain to train the classifier...

Classifying "quick money": bad


In [39]:
def probabilidades_palavras_fisher():
    cl = fisherclassifier(getwords)
    print('\n')    
    sampletrain(cl)
    print('\n')      
    print('Probability of "quick" given that "good": {}'.format(cl.cprob('quick', 'good')))
    print('Probability of "money" given that "bad": {}'.format(cl.cprob('money', 'bad')))
    print('Weighted probability of  "money" given that "bad": {}'.format(cl.weightedprob('money','bad',cl.cprob)))

In [40]:
probabilidades_palavras_fisher()



Running sampletrain to train the classifier...


Probability of "quick" given that "good": 0.571428571429
Probability of "money" given that "bad": 1.0
Weighted probability of  "money" given that "bad": 0.75


In [41]:
def probabilidades_documentos_fisher():
    cl = fisherclassifier(getwords)
    print('\n')    
    sampletrain(cl)

    print('Classifying "quick rabbit": {}'.format(cl.classify('quick rabbit')))
    print('Classifying "quick money": {}'.format(cl.classify('quick money')))
   
    print('\nSetting the threshold up...')
    cl.setminimum('bad',0.8)
    print('Classifying "quick money": {}'.format(cl.classify('quick money')))

    print('\nSetting the threshold down...')
    cl.setminimum('bad',0.4)
    print('Classifying "quick money": {}'.format(cl.classify('quick money')))

In [42]:
probabilidades_documentos_fisher()



Running sampletrain to train the classifier...
Classifying "quick rabbit": good
Classifying "quick money": bad

Setting the threshold up...
Classifying "quick money": good

Setting the threshold down...
Classifying "quick money": bad


In [43]:
def using_db_example():
    '''Training with a classifier, persisting in a database
    using the training data to classify using another classifier'''
    print('\nInstantiating a fisher classifier...')
    cl = fisherclassifier(getwords, usedb=True)
    cl.setdb(db_teste)
    sampletrain(cl)
    print('\nInstantiating a naive bayes classifier...')    
    cl2 = naivebayes(getwords, usedb=True)
    cl2.setdb(db_teste)
    print('Classifying "quick money": {}'.format(cl2.classify('quick money')))

In [44]:
using_db_example()


Instantiating a fisher classifier...
Running sampletrain to train the classifier...

Instantiating a naive bayes classifier...
Classifying "quick money": bad


In [45]:
def classifying_blogs(subject='python'):
    '''Instantiating a new classifier using "entryfeatures" (for feeds)
    Creating the database for the persistance of training data
    Using blogread with searching feeds option - no file reading'''
    cl = fisherclassifier(entryfeatures, usedb=True)
    cl.setdb(db_blog)
    if not subject:
        subject = raw_input(u'\n\nPlease enter a subject to search for feeds: ').lower()
    blogread(subject, cl)    
    print('\nList of categories stored in the database:')
    for category in cl.categories(): 
        print(category)
    return cl

In [69]:
cl = classifying_blogs('moluscos')


-----
Title:     Áreas de cultivo de <b>moluscos</b> em SC voltam a ser interditadas - Geral
Publisher: Últimas notícias de Joinville e região - A Notícia
(u'Date Published: ', datetime.datetime(2014, 10, 10, 22, 21, 33))

-----
A toxina diarreica (DSP) voltou a ser encontrada em <em>moluscos</em> bivalves de quatro localidades do litoral catarinense. Nesta quinta-feira, a Secretaria de Estado da Agricultura e da Pesca interditou Zimbros e Canto Grande, em&nbsp;...
Suggested: corrupcao
Enter category or press <enter> to accept suggestion: pesca
Category "pesca" chosen

-----
Title:     Conheça diferentes grupos de <b>moluscos</b> | Gente que Educa
Publisher: Nova Escola - Planos de Aula - Ensino Fundamental
(u'Date Published: ', datetime.datetime(2010, 7, 26, 17, 31))

-----
Divida os animais em três bandejas diferentes e identifique cada bandeja com o nome daquele grupo de animais. Nesse momento não explique as características de cada grupo, permita que as características externas de

Do some tests now:

In [68]:
#cl.cprob(<word>,<category>)
print(cl.cprob('petrobras','corrupcao'))

#cl.fprob(<word>,<category>)
print(cl.fprob('fields','ciencia'))

0.0666666666667
0
