<h1 style='margin-bottom:20px;'>Data Science Module</h1>
<p style='margin:0;padding:0;'>Sergio de la Cruz Badillo</p>
<p style='margin:0;padding:0;'>dlcruzser12@gmail.com</p>
<p style='margin:0;padding:0;'>06/Enero/2020</p>

# Text Classification (Text Categorization or Document Classification)

The goal in classification is to take an input vector $x$ and to assign it to one of $K$ discrete class $C_k$ where $k=1, ..., K$.

In [1]:
#!conda install -c conda-forge pdfminer.six -y
#!conda install -c conda-forge pattern -y
#!conda install -c conda-forge gensim -y

import importlib
if importlib.util.find_spec('pdfminer') is None:
    print("*******************************************************************")
    print("**** ¡WARNING! ***** ¡WARNING! ***** ¡WARNING! ***** ¡WARNING! ****")
    print("*******************************************************************", end="\n\n")
    print("Instalación de 'pdfminer' con conda para python 3 'conda install -c conda-forge pdfminer.six -y'")
if importlib.util.find_spec('pattern') is None:
    print("*******************************************************************")
    print("**** ¡WARNING! ***** ¡WARNING! ***** ¡WARNING! ***** ¡WARNING! ****")
    print("*******************************************************************", end="\n\n")
    print("Instalación de 'pattern' con conda para python 3 'conda install -c conda-forge pattern -y'")
if importlib.util.find_spec('gensim') is None:
    print("*******************************************************************")
    print("**** ¡WARNING! ***** ¡WARNING! ***** ¡WARNING! ***** ¡WARNING! ****")
    print("*******************************************************************", end="\n\n")
    print("Instalación de 'gensim' con conda para python 3 'conda install -c conda-forge gensim -y'")

In [2]:
import gensim
import nltk
import numpy as np
import pandas as pd
import pprint
import re
import requests
import scipy.sparse as sp
import string
import sys
import urllib.parse

from bs4 import BeautifulSoup
from io import StringIO
from io import BytesIO
from io import open
from multiprocessing import Pool
from numpy.linalg import norm
from nltk.corpus import wordnet as wn
from nltk.stem import WordNetLemmatizer
from pattern.es import tag
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from urllib.request import urlopen

In [3]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sergio.delacruz/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
class Publication:
    """
    A class used to get information about the Banxico publication

    Attributes
    ----------
    date_view: publication date
    text_view: publication text view
    label: publication status or label ('mantiene', 'disminuye' and 'incrementa')
    url: pdf publication file url
    text: pdf publication text
    normalized_text: pdf publication normalized text
    token_list: token list from normalized text

    Methods
    -------
    __init__(date_view, text_view, label, url, text, normalized_text, token_list): constructor method
    __str__(): to string method
    """
    
    def __init__(self, date_view, text_view, label, url, text, normalized_text, token_list):
        '''
        Constructor method
        '''
        self.date_view = date_view
        self.text_view = text_view
        self.label = label
        self.url = url
        self.text = text
        self.normalized_text = normalized_text
        self.token_list = token_list
        
    def __str__(self):
        '''
        To string method
        '''
        print('date_view: {0}'.format(self.date_view))
        print('text_view: {0}'.format(self.text_view))
        print('label: {0}'.format(self.label))
        print('url: {0}'.format(self.url))
        print('text: {0}...'.format(self.text[:24]))
        print('normalized_text: {0}'.format(self.normalized_text))
        print('token_list: {0}'.format(self.token_list))
        return ''

## Web Scraping

Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table (spreadsheet) format.

Banxico URLs:
- https://www.banxico.org.mx/
- https://www.banxico.org.mx/publicaciones-y-prensa/anuncios-de-las-decisiones-de-politica-monetaria/anuncios-politica-monetaria-t.html
- https://www.banxico.org.mx/publicaciones-y-prensa/anuncios-de-las-decisiones-de-politica-monetaria/{file_name}.pdf

In [5]:
class Crawler:
    '''
    Crawler class contains the main processes to get information from a website.
    
    Attributes
    ----------
    session: publication date
    url_base: publication text view
    headers: publication status or label ('mantiene', 'disminuye' and 'incrementa')

    Methods
    -------
    __init__(): constructor method
    __str__(): to string method
    get_page(url): to get BeautifulSoup with information about a website
    get_publication_list(bs, processes_num=0): publication list from banxico
    read_PDF(file_url): to read text from a pdf file
    '''
    
    def __init__(self):
        '''
        Constructor method
        '''
        self.session = requests.Session()
        self.url_base = ('https://www.banxico.org.mx')
        self.headers = {'User-Agent': ('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) '
                                       'AppleWebKit/537.36 (KHTML, like Gecko) '
                                       'Chrome/39.0.2171.95 Safari/537.36'),
                        'Accept': ('text/html,application/xhtml+xml,application/xml;'
                                   'q=0.9,image/webp,*/*;'
                                   'q=0.8')}
    def get_page(self, url):
        '''
        To get BeautifulSoup with information about a website.
        
        Parameters
        ----------
        url: website url
        '''
        try:
            req = self.session.get('{}{}'.format(self.url_base, url), headers=self.headers)
        except:
            print("Oops!", sys.exc_info()[0], "occured.")
            return None
        return BeautifulSoup(req.text, 'html.parser')
    
    def get_publication_list(self, bs):
        '''
        Publication list from banxico.
        
        Parameters
        ----------
        url: website url
        '''
        table = bs.find('table', {'class': {'table table-striped bmtableview'}})
        if table == None:
            print("Table not found.")
            return None

        tr_list = table.find('tbody').findChildren('tr')
        if tr_list == None:
            print("Items into table not found")
            return None

        corpus = []

        for tr in tr_list:
            date_view = tr.find('td', {'class': {'bmdateview'}})
            text_view = tr.find('td', {'class': {'bmtextview'}})
            url = tr.find('a')
            label = None

            if date_view != None:
                date_view = date_view.get_text().strip()
            if text_view != None:
                text_view = text_view.get_text().split('Texto completo')[0].strip()
                if 'incrementa' in text_view.lower() or 'aumenta' in text_view.lower():
                    label = 'incrementa'
                elif 'mantiene' in text_view.lower():
                    label = 'mantiene'
                elif 'disminuye' in text_view.lower() or 'reduce' in text_view.lower():
                    label = 'disminuye'
            if url != None and 'href' in url.attrs:
                url = urllib.parse.urljoin(self.url_base, url.get('href'))
            
            publication = Publication(date_view, text_view, label, url, self.read_PDF(url), None, None)
            #publication.__str__()
            corpus.append(publication)
    
        return corpus
    
    def read_PDF(self, file_url):
        '''
        To read text from a pdf file
        
        Parameters
        ----------
        file_url: file url
        '''
        rsrcmgr = PDFResourceManager()
        retstr = StringIO()
        laparams = LAParams()
        device = TextConverter(rsrcmgr, retstr, codec='utf-8', laparams=laparams)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        
        # Extract text
        file = urllib.request.urlopen(file_url).read()
        fp = BytesIO(file)
        for page in PDFPage.get_pages(fp):
            interpreter.process_page(page)
        text = retstr.getvalue()
        
        # Cleanup
        fp.close()
        device.close()
        retstr.close()
        
        if text != None:
            text = text.strip()

        return text
    
    def __str__(self):
        '''
        To string method
        '''
        print('session: {0}'.format(self.session))
        print('url_base: {0}'.format(self.url_base))
        print('headers: {0}'.format(self.headers))
        return ''

Proceso para obtener información de una página web. El proceso obtendrá el listado de las publicaciones de banxico desde https://www.banxico.org.mx/publicaciones-y-prensa/anuncios-de-las-decisiones-de-politica-monetaria/anuncios-politica-monetaria-t.html

In [6]:
%%time

crawler = Crawler()
path = ('/publicaciones-y-prensa/'
        'anuncios-de-las-decisiones-de-politica-monetaria/'
        'anuncios-politica-monetaria-t.html')

bs = crawler.get_page(path)
publications = None

if bs is not None:
    #print(bs.prettify())
    publications = crawler.get_publication_list(bs)
    print('Número de publicaciones: {0}\n'.format(len(publications)))
    print('Ejemplo, corpus[0]\n')
    print(publications[0])
else:
    print('** No se pudo obtener listado de publicaciones.\n')

Número de publicaciones: 183

Ejemplo, corpus[0]

date_view: 19/12/19
text_view: El objetivo para la Tasa de Interés Interbancaria a 1 día (tasa objetivo) disminuye en 25 puntos base
label: disminuye
url: https://www.banxico.org.mx/publicaciones-y-prensa/anuncios-de-las-decisiones-de-politica-monetaria/{D5328504-B958-4BE0-44A3-9363FCAC99D3}.pdf
text: Anuncio de Política Mone...
normalized_text: None
token_list: None

CPU times: user 42.8 s, sys: 775 ms, total: 43.6 s
Wall time: 1min 6s


## Processing and Understanding Text 

### Text Tokenization

- Sentence Tokenization (checked)
- Word Tokenization (checked)

### Text Normalization

- Cleaning Text (checked)
- Tokenizing Text (checked)
- Removing Special Characters (checked)
- Expanding Contractions (unnecessary)
- Case Conversions (checked)
- Removing Stopwords (checked)
- Stemming
- Lemmatization (checked)

In [7]:
class Normalize:
    '''
    Normalize class contains the main processes to normalize a text from a document.
    
    Attributes
    ----------
    stopword_list: spanish stopword list from nltk module
    wnl: WordNetLemmatizer instancce

    Methods
    -------
    __init__()
    tokenize_text(text)
    normalize_corpus(text)
    remove_special_characters_before(text)
    remove_special_characters_after(text)
    remove_stopwords(text)
    remove_repeated_chars(text)
    '''
    
    def __init__(self):
        '''
        Constructor method
        '''
        self.stopword_list = nltk.corpus.stopwords.words('spanish')
        self.wnl = WordNetLemmatizer()
        
    def tokenize_text(self, text):
        '''
        Method to tokenize a text from a document.
        
        Parameters
        ----------
        text: text from a document
        '''
        tokens = nltk.word_tokenize(text) 
        tokens = [token.strip() for token in tokens]
        return tokens
    
    def normalize_corpus(self, text):
        '''
        Method to normalize a text from a document.
        
        Parameters
        ----------
        text: text from a document
        '''
        text = text.lower()
        text = self.remove_repeated_chars(text)
        text = self.remove_special_characters_before(text)
        text = self.remove_special_characters_after(text)
        text = self.remove_stopwords(text)
        return text
    
    def remove_special_characters_before(self, text):
        '''
        Method to remove general characters into a text from a document.
        
        Parameters
        ----------
        text: text from a document
        '''
        #PATTERN = r'[?|,|;|:|.|•|$|%|&|*|@|(|)|~]'
        PATTERN = r'[•|~|ª|!|"|·|$|%|&|/|(|)|=|?|¿|*|^|Ç|¨|_|:|;|º|\'|¡|`|+|ç|´|-|.|,|\\|\||@|#|¢|∞|¬|÷|“|≠|´|‚||}|{|–|…|„|}]'
        #PATTERN = r'[^a-zA-Z0-9 ]'
        tokens = self.tokenize_text(text)
        filtered_tokens = filter(None, [re.sub(PATTERN, r'', token) for token in tokens])
        filtered_text = ' '.join(filtered_tokens)
        return filtered_text

    def remove_special_characters_after(self, text):
        '''
        Method to remove special characters into a text from a document.
        
        Parameters
        ----------
        text: text from a document
        '''
        tokens = self.tokenize_text(text)
        pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
        filtered_tokens = filter(None, [pattern.sub('', token) for token in tokens])
        filtered_text = ' '.join(filtered_tokens)
        return filtered_text
    
    def remove_stopwords(self, text):
        '''
        Method to remove stopwords into a text from a document.
        
        Parameters
        ----------
        text: text from a document
        '''
        tokens = self.tokenize_text(text)
        filtered_tokens = [token for token in tokens if token not in self.stopword_list]
        filtered_text = ' '.join(filtered_tokens)
        return filtered_text
    
    def remove_repeated_chars(self, text):
        '''
        Method to remove repeated characters into a text from a document.
        
        Parameters
        ----------
        text: text from a document
        '''
        tokens = self.tokenize_text(text)
        repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
        match_substitution = r'\1\2\3'

        def replace(old_word):
            if wn.synsets(old_word):
                return old_word
            new_word = repeat_pattern.sub(match_substitution, old_word)
            return replace(new_word) if new_word != old_word else new_word

        correct_tokens = [replace(word) for word in tokens]
        correct_tokens = ' '.join(correct_tokens)
        
        return correct_tokens

Proceso para obtener listado de documentos y listado de etiquetas.

In [8]:
%%time

mean_word = []
corpus = []
labels = []

for i in range(len(publications)):
    normalize = Normalize()
    if publications[i].text != None:
        publications[i].normilized_text = normalize.normalize_corpus(publications[i].text)
        publications[i].token_list = normalize.tokenize_text(publications[i].normilized_text)
        corpus.append(publications[i].normilized_text)
        labels.append(publications[i].label)
        mean_word.append(len(publications[i].token_list))
    else:
        print('Publication without text content')
    
print('Número de palabras promedio por documento: {}\n'.format(np.mean(mean_word)))
print('Ejemplo, corpus[0].token_list[:24]\n\n{}\n'.format(publications[0].token_list[:24]))
print('Etiquetas, labels:\n\n{}\n'.format(set(labels)))



Número de palabras promedio por documento: 548.6174863387978

Ejemplo, corpus[0].token_list[:24]

['anuncio', 'política', 'monetaria', 'comunicado', 'prensa', '19', 'diciembre', '2019', 'junta', 'gobierno', 'banco', 'méxico', 'decidido', 'disminuir', '25', 'puntos', 'base', 'objetivo', 'tasa', 'interés', 'interbancaria', 'día', 'nivel', '725']

Etiquetas, labels:

{'disminuye', 'incrementa', 'mantiene'}

CPU times: user 7.22 s, sys: 131 ms, total: 7.35 s
Wall time: 7.93 s


## Feature Extraction (Feature Engineering)

### Feature-extraction techniques:
- Bag of Words model
- TF-IDF model
- Advanced word vectorization models

In [9]:
class FeatureExtraction:
    
    def bow_extractor(self, corpus, ngram_range=(1,1)):
        '''
        Bag of Words model converts text documents into vectors such that each document is 
        converted into a vector that represents the frequency of all the distinct words that 
        are present in the document vector space for that specific document.
        
        CountVectorizer class converts a collection of text documents to a matrix of token 
        counts. If you do not provide an a-priori dictionary and you do not use an analyzer 
        that does some kind of feature selection then the number of features will be equal to 
        the vocabulary size found by analyzing the data.
        
        Important notes
        ------------------
        min_df=1 => indicates taking terms having a minimum frequency of 1 in the overall 
        document vector space.
        ngram_range => has various parameters, feature vectors, like (1, 3) consisting of all 
        unigrams, bigrams, and trigrams.
        fit_transform(self, raw_documents[, y]) => Learn the vocabulary dictionary and return 
        term-document matrix.
        
        More information 
        ------------------
        CountVectorizer class => https://scikit-learn.org/stable/search.html?q=CountVectorizer
        
        Parameters
        ------------------
        corpus: text to convert
        ngram_range: to take into account n-grams as features
        '''
        vectorizer = CountVectorizer(min_df=1, ngram_range=ngram_range)
        features = vectorizer.fit_transform(corpus)
        return vectorizer, features
    
    def tfidf_extractor(self, corpus, ngram_range=(1,1)):
        '''
        Term Frequency-Inverse Document Frequency (TF-IDF) is a combination of two metrics: 
        term frequency and inverse document frequency. Term frequency denoted by tf is what 
        we had computed in the Bag of Words model. It is denoted by the raw frequency value 
        of that term in a particular document. Inverse document frequency denoted by idf is 
        the inverse of the document frequency for each term. It is computed by dividing the 
        total number of documents in our corpus by the document frequency for each term and 
        then applying logarithmic scaling on the result.
        
        Important notes
        ------------------
        norm='l2' => smoothens the idfs to give weightages also to terms that may have zero 
        idf so that we do not ignore them.
        ngram_range => has various parameters, feature vectors, like (1, 3) consisting of all 
        unigrams, bigrams, and trigrams.
        fit_transform(self, raw_documents[, y]) => Learn the vocabulary dictionary and return 
        term-document matrix.
        
        More information 
        ------------------
        CountVectorizer class => https://scikit-learn.org/stable/search.html?q=TfidfVectorizer
        
        Parameters
        ------------------
        corpus: text to convert
        ngram_range: to take into account n-grams as features
        '''
        vectorizer = TfidfVectorizer(min_df=1, norm='l2', smooth_idf=True, use_idf=True, ngram_range=ngram_range)
        features = vectorizer.fit_transform(corpus)
        return vectorizer, features
    
    def average_word_vectors(self, words, model, vocabulary, num_features):
        feature_vector = np.zeros((num_features,), dtype="float64")
        nwords = 0.
        
        for word in words:
            if word in vocabulary: 
                nwords = nwords + 1.
                feature_vector = np.add(feature_vector, model[word])
                if nwords:
                    feature_vector = np.divide(feature_vector, nwords)
        
        return feature_vector
    
    def averaged_word_vectorizer(self, corpus, model, num_features):
        vocabulary = set(model.wv.index2word)
        features = [self.average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
        return np.array(features)
    
    def tfidf_wtd_avg_word_vectors(self, words, tfidf_vector, tfidf_vocabulary, model, num_features):
        word_tfidfs = [tfidf_vector[0, tfidf_vocabulary.get(word)] 
                       if tfidf_vocabulary.get(word) else 0 for word in words]    
        word_tfidf_map = {word:tfidf_val for word, tfidf_val in zip(words, word_tfidfs)}
        feature_vector = np.zeros((num_features,),dtype="float64")
        vocabulary = set(model.wv.index2word)
        wts = 0.
    
        for word in words:
            if word in vocabulary: 
                word_vector = model[word]
                weighted_word_vector = word_tfidf_map[word] * word_vector
                wts = wts + word_tfidf_map[word]
                feature_vector = np.add(feature_vector, weighted_word_vector)
                
        if wts:
            feature_vector = np.divide(feature_vector, wts)

        return feature_vector
    
    def tfidf_weighted_averaged_word_vectorizer(self, corpus, tfidf_vectors, tfidf_vocabulary, model, num_features):
        docs_tfidfs = [(doc, doc_tfidf) for doc, doc_tfidf in zip(corpus, tfidf_vectors)]
        features = [self.tfidf_wtd_avg_word_vectors(tokenized_sentence, tfidf, tfidf_vocabulary, model, num_features)
                    for tokenized_sentence, tfidf in docs_tfidfs]
        return np.array(features)
    
    def prepare_datasets(self, corpus, labels, test_data_proportion=0.3):
        train_X, test_X, train_Y, test_Y = train_test_split(
            corpus, labels, test_size=0.33, random_state=42)
        return train_X, test_X, train_Y, test_Y
    
    def show_features(self, features, feature_names):
        return pd.DataFrame(data=features, columns=feature_names)

In [10]:
featureExt = FeatureExtraction()

train_corpus, test_corpus, train_labels, test_labels = featureExt.prepare_datasets(
    corpus, labels, test_data_proportion=0.3)

### Bag Of Words Model, samples

In [11]:
bow_vectorizer, bow_train_features = featureExt.bow_extractor(train_corpus)
bow_test_features = bow_vectorizer.transform(test_corpus)
feature_names = bow_vectorizer.get_feature_names()
#print(featureExt.show_features(bow_train_features, feature_names).head())
#print(featureExt.show_features(bow_test_features, feature_names).head())

### Term Frequency-Inverse Document Frequency (TF-IDF) Model, samples

$$ idf(t)=1 + \log \frac{C}{1+df(t)} $$

$$ tfidf = \frac{tfidf}{\left|\left| tfidf\ \right|\right|} $$

In [12]:
tfidf_vectorizer, tfidf_train_features = featureExt.tfidf_extractor(train_corpus)
tfidf_test_features = tfidf_vectorizer.transform(test_corpus)
feature_names = bow_vectorizer.get_feature_names()
#print(featureExt.show_features(tfidf_train_features, feature_names).head())
#print(featureExt.show_features(tfidf_test_features, feature_names).head())

### Advanced Word Vectorization Models

- Averaged word vectors
- TF-IDF weighted word vectors

In [13]:
# tokenize corpora (documents)
tokenized_train = [nltk.word_tokenize(text) for text in train_corpus]
tokenized_test = [nltk.word_tokenize(text) for text in test_corpus]

# build the word2vec model on our training corpus
model = gensim.models.Word2Vec(tokenized_train, size=500, window=100, min_count=30, sample=1e-3)

### Averaged Word Vectors

$$ AWV(D) = \frac{{\displaystyle {\sum}_1^n}wv(w)}{n} $$

In [14]:
avg_wv_train_features = featureExt.averaged_word_vectorizer(
    corpus=tokenized_train, model=model, num_features=500)
avg_wv_test_features = featureExt.averaged_word_vectorizer(
    corpus=tokenized_test, model=model, num_features=500)



### TF-IDF Weighted Averaged Word Vectors

$$ TWA(D) = \frac{{\displaystyle {\sum}_1^n}wv(w) \times tfidf(w)}{n} $$

In [15]:
vocab = tfidf_vectorizer.vocabulary_

tfidf_wv_train_features = featureExt.tfidf_weighted_averaged_word_vectorizer(
    corpus=tokenized_train, tfidf_vectors=tfidf_train_features, tfidf_vocabulary=vocab, model=model, num_features=500)
tfidf_wv_test_features = featureExt.tfidf_weighted_averaged_word_vectorizer(
    corpus=tokenized_test, tfidf_vectors=tfidf_test_features, tfidf_vocabulary=vocab, model=model, num_features=500)



## Classification Algorithms

There are various types of classification algorithms:
- Multinomial Naïve Bayes
- Support vector machines

In [16]:
class Classification:

    def prepare_datasets(self, corpus, labels, test_data_proportion=0.3):
        train_X, test_X, train_Y, test_Y = train_test_split(corpus, labels, test_size=0.33, random_state=42)
        return train_X, test_X, train_Y, test_Y

    def remove_empty_docs(self, corpus, labels):
        filtered_corpus = []
        filtered_labels = []
        for doc, label in zip(corpus, labels):
            if doc.strip():
                filtered_corpus.append(doc)
                filtered_labels.append(label)
        return filtered_corpus, filtered_labels
    
    def train_predict_evaluate_model(self, model_name, classifier, train_features, train_labels, test_features, test_labels):
        # build model    
        classifier.fit(train_features, train_labels)
        # predict using model
        predictions = classifier.predict(test_features) 
        # evaluate model prediction performance   
        self.get_metrics(model_name, true_labels=test_labels, predicted_labels=predictions)
        return predictions 
    
    def get_metrics(self, model_name, true_labels, predicted_labels):
        print('******** {} ********'.format(model_name))
        print('Accuracy:', np.round(metrics.accuracy_score(true_labels, predicted_labels), 2))
        print('Precision:', np.round(metrics.precision_score(true_labels, predicted_labels, average='weighted'), 2))
        print('Recall:', np.round(metrics.recall_score(true_labels, predicted_labels, average='weighted'), 2))
        print('F1 Score:', np.round(metrics.f1_score(true_labels, predicted_labels, average='weighted'), 2))
        print('Confusion matrix:\n', self.get_confusion_matrix(true_labels, predicted_labels))
        print('')
        
    def get_confusion_matrix(self, test_labels, predictions):
        '''
        Method to get the confusion matrix about result.
        
        Parameters
        ------------------
        test_corpus: text from corpus
        test_labels: labels from corpus
        predictions: predicted labels
        '''
        cm = metrics.confusion_matrix(test_labels, predictions, labels=['disminuye','incrementa','mantiene'])
        return pd.DataFrame(
            data=cm, 
            columns=pd.MultiIndex(
                levels=[['Predicted:'], ['disminuye','incrementa','mantiene']],
                labels=[[0,0,0],[0,1,2]]),
            index=pd.MultiIndex(
                levels=[['Actual:'], ['disminuye','incrementa','mantiene']],
                labels=[[0,0,0],[0,1,2]]))
    
    def get_fake_documents(self, test_corpus, test_labels, predictions):
        '''
        Method to information about the publications which were incorrectly predicted.
        
        Parameters
        ------------------
        test_corpus: text from corpus
        test_labels: labels from corpus
        predictions: predicted labels
        '''
        for document, label, predicted_label in zip(test_corpus, test_labels, predictions):
            if label != predicted_label:
                print('Actual label:', label)
                print('Predicted label:', predicted_label)
                print('Document text:')
                print(re.sub('\n', ' ', document[:200]))
                print('')

In [17]:
classification = Classification()
mnb = MultinomialNB()
svm = SGDClassifier(loss='hinge')

## Bag of words features

# Multinomial Naive Bayes with bag of words features
mnb_bow_predictions = classification.train_predict_evaluate_model(
    model_name='Multinomial Naive Bayes with bag of words features',
    classifier=mnb, train_features=bow_train_features, train_labels=train_labels, 
    test_features=bow_test_features, test_labels=test_labels)
# Support Vector Machine with bag of words features
svm_bow_predictions = classification.train_predict_evaluate_model(
    model_name='Support Vector Machine with bag of words features',
    classifier=svm, train_features=bow_train_features, train_labels=train_labels, 
    test_features=bow_test_features, test_labels=test_labels)


## Term Frequency-Inverse Document Frequency (TF-IDF) features

# Multinomial Naive Bayes with TF-IDF features
mnb_tfidf_predictions = classification.train_predict_evaluate_model(
    model_name='Multinomial Naive Bayes with TF-IDF features',
    classifier=mnb, train_features=tfidf_train_features, train_labels=train_labels,
    test_features=tfidf_test_features, test_labels=test_labels)
# Support Vector Machine with TF-IDF features
svm_tfidf_predictions = classification.train_predict_evaluate_model(
    model_name='Support Vector Machine with TF-IDF features',
    classifier=svm, train_features=tfidf_train_features, train_labels=train_labels,
    test_features=tfidf_test_features, test_labels=test_labels)


## Averaged Word Vector features

# Support Vector Machine with averaged word vector features
svm_avgwv_predictions = classification.train_predict_evaluate_model(
    model_name='Support Vector Machine with averaged word vector features',
    classifier=svm, train_features=avg_wv_train_features, train_labels=train_labels,
    test_features=avg_wv_test_features, test_labels=test_labels)
# Support Vector Machine with tfidf weighted averaged word vector features
svm_tfidfwv_predictions = classification.train_predict_evaluate_model(
    model_name='Support Vector Machine with tfidf weighted averaged word vector features',
    classifier=svm, train_features=tfidf_wv_train_features, train_labels=train_labels,
    test_features=tfidf_wv_test_features, test_labels=test_labels)

******** Multinomial Naive Bayes with bag of words features ********
Accuracy: 0.74
Precision: 0.77
Recall: 0.74
F1 Score: 0.69
Confusion matrix:
                    Predicted:                    
                    disminuye incrementa mantiene
Actual: disminuye           2          0        4
        incrementa          0          4       11
        mantiene            0          1       39

******** Support Vector Machine with bag of words features ********
Accuracy: 0.8
Precision: 0.82
Recall: 0.8
F1 Score: 0.78
Confusion matrix:
                    Predicted:                    
                    disminuye incrementa mantiene
Actual: disminuye           2          1        3
        incrementa          0          8        7
        mantiene            0          1       39

******** Multinomial Naive Bayes with TF-IDF features ********
Accuracy: 0.66
Precision: 0.43
Recall: 0.66
F1 Score: 0.52
Confusion matrix:
                    Predicted:                    
                

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [18]:
classification.get_fake_documents(test_corpus, test_labels, mnb_bow_predictions)

Actual label: incrementa
Predicted label: mantiene
Document text:
comunicado prensa 23 julio 204 anuncio política monetaria junta gobierno banco méxico decidido aumentar corto ” 41 milones pesos partir hoy mantiene expansión económica mundial si bien ritmo ligeramen

Actual label: incrementa
Predicted label: mantiene
Document text:
anuncio política monetaria comunicado prensa 8 febrero 2018 junta gobierno banco méxico decidido aumentar 25 puntos base objetivo tasa interés interbancaria día nivel 750 ciento economía mundial sigui

Actual label: incrementa
Predicted label: mantiene
Document text:
15 agosto 208 comunicado prensa anuncio política monetaria junta gobierno banco méxico decidido incrementar 825 ciento objetivo tasa interés interbancaria 1 día desaceleración económica mundial intens

Actual label: disminuye
Predicted label: mantiene
Document text:
6 junio 2014 anuncio política monetaria comunicado prensa junta gobierno banco méxico decidido disminuir 50 puntos base objetivo ta

## Evaluating Classification Models

Metrics:

- Accuracy
- Precision
- Recall
- F1 score

A $confusion$ $matrixis$ a tabular structure that helps visualize the performance of classifiers.

<h1 style='margin-bottom:20px;'>Data Science Module</h1>
<p style='margin:0;padding:0;'>Sergio de la Cruz Badillo</p>
<p style='margin:0;padding:0;'>dlcruzser12@gmail.com</p>
<p style='margin:0;padding:0;'>06/Enero/2020</p>