<a href="https://colab.research.google.com/github/testUNECE/template/blob/master/3_ML_modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 3 Machine Learning Modeling

## 3.1 Split to training set and test set
---
In machine learning, a common task is to study and construct algorithms that can learn from and make predictions on data. 

The model is initially fit on a training dataset, that is a set of samples used to fit the parameters of the model. 

An algorithm that best fits the training data may not fit the unseen real-wrold data. In order to assess the performance of the model and reduce its variance, test data is needed.

The test dataset is independent of the training dataset but that follows the same probability distribution as the training dataset. 

80/20, which is referred to as the Pareto principle, is a rule of thumb for dividing a dataset into training and test data.

##3.2 Implement Machine Learning Models

---



There are six machine learning models adopted to train the data: Multinomial Naive Bayes, Support Vector Machine (linear svc), Logistic Regression, Random Forest, XGBoost, and Deep Neural Network.   

We will train models on the training set. Within each training process, we will adopt grid search to find optimal hyperparamters and calculate the model accuracies. Then, assesss the model performance on test data.

Below is the terminology explanation of search method and accuracy measure adopted.

<b>3.2.1 Tune Hyperparameters Using GridSearchCV</b>

A hyperparameter is a parameter whose value is used to control the learning process. 

The same kind of machine learning model can require different sets of hyperparameters such as constants, weights, or learning rates to generlaize different data patterns. Hyperparameter tuning is to choose a set of optimal hyperparameters for a machine learning algorithm. 

In our case, grid search is used to perform hyperparameter tuning for each model. Compared with the random serach, grid search is an exhausted method which considers all combinations of values of parameters from predefined subsets. Grid search guarantees that the optimal parameters are selected from the subsets. The disadvantage is that its relatively high computational cost compared with random search.

<b>3.2.2 Machine Learning Models</b>
* <b>1) Multinomial Naive Bayes</b>
Naive Bayes is a family of probabilistic classifiers, which calculates the probability of each category using Bayes theorem with strong independence assumptions between the features. It is the simplest model among other alternative text classification models. When dealing with text data, multinomial naive bayes is adopted.
* <b>2) Linear Support Vector Machine</b>
A support vector machine determines the best decision boundary between vectors that belong to a given group/category and vectors that do not belong to it. In our case, we adopted a linear classifier because the linear kernel is computationally cheap and works well for text classification problems.
* <b>3) Logistic Regression</b>
Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function.Logistic regression can be seen as a special case of the generalized linear model and thus analogous to linear model. In natual language processing, logistic rgression is the baseline supervised machine learning algorithm for classification, and has a very close relationship with neural network.
* <b>4) Random Forest</b>
Random forest is an ensemble learning method for classification task. It is constructed by a multitude of decision trees and outputs the class that is the mode of the classes. Random forest corret for decision trees' overfitting issue using a technique of bootstrap aggregating.
* <b>5) XGBoost</b>
XGBoost stands for eXtreme Gradient Boosting and is an implementation of gradient boosting machines that pushes the limits of computing power for boosted models.Boosting models are another type of ensemble models part of tree based models. Boosting is a machine learning ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. Use tree based XGBoost model in this case.
* <b>6) Neural Network</b>
Deep learning is a class of machine learning algorithms that uses multiple layers to progressively extract higher level features from the raw input. Deep Neural Networks architectures are designed to learn through multiple connections of layers where every single layer only receives a connection from previous and provides connections only to the next layer in the hidden part. The input layer is embedding vectors. The output layer neurons equal to the number of classes for multi-class classification.

## 3.3 Accuracy measure

---

The classification_report shows a representation of the main classification metrics on a per-class basis. This gives a deeper intuition of the classifier behavior over global accuracy which can mask functional weaknesses in one class of a multiclass problem. The metrics are calculated by using true and false positives (TP, FP), true and false negatives (TN, FN). 

<table >
<tbody>
  <tr>
    <td> </td>
    <td> </td>
    <td colspan = 2; align = 'center'> <b> True </b> </td>
  </tr>
  <tr>
    <td> </td>
    <td> </td>
    <td> Positive </td>
    <td> Negative </td>
  </tr>
  <tr>
    <td rowspan = 2> <b> Predicted </b> </td>
    <td> Positive </td>
    <td> True Positive (TP) </td>
    <td> False Positive (FP) </td>
  </tr>
  <tr>
    <td> Negative </td>
    <td> False Negative (FN) </td>
    <td> True Negative (TN) </td>
  </tr>
  </tbody>
  </table>

    



Precision: It is the ratio of the total number of correctly classified positive examples divided to the total number of predicted positive examples. High precision indicates an example labelled as positive is indeed positive. Precision = TP/(TP+FP)

$ Precision = \frac{TP}{TP+FP} $

Recall: It is the ratio of the total number of correctly classified positive examples divide to the total number of positive examples. High Recall indicates the calss is correctly recognized. Recall = TP/ (TP+FN).

$ Recall = \frac{TP}{TP+FN} $

F1 score: It is a weighted harmonic mean of precision and recall such that the best score is 1 and the worst is 0. 

Support: Support is the numbner of actual occurrences of the class in the dataset. 

## 3.4 Working Example Using ECOICOP data

---



> Load Required Libraries and Packages 


Sklearn is a machine learning library which enables implementation of classification, regression and clustering algorithms, such as support vector machine, random forest, bayes, gradient boosting and so on.

XGBoost is an open-source software library which provides a gradient boosting framework.

Keras is an open-source neural-network library written in Python. 
      

In [0]:
# Upload files
from google.colab import files 
import io
# Import data
import pandas as pd
# Plot graphs 
import matplotlib.pyplot as plt
# Create training and testing data
from sklearn.model_selection import train_test_split, GridSearchCV
# Natrual language processing
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
# Implement machine learning models
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB 
from sklearn import svm
from sklearn import linear_model
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from keras.layers import  Dropout, Dense
from keras.models import Sequential
# Assess model performance
from sklearn.metrics import classification_report, accuracy_score
# Record the running time of training the model.
import timeit

Using TensorFlow backend.


In [0]:
# Import excel file on github
df = pd.read_excel('https://raw.githubusercontent.com/UNECE/ML_dataset/master/Stats%20Poland%20ECOICOP%20data.xlsx', sheet_name = 'Data')

In [0]:
# Prepare training data and test data
x_train, x_test, y_train, y_test = train_test_split(df['Desc_E'], df['Code_E'], test_size=0.20, random_state=0)

> 1) Multinomial Naive Bayes

In [0]:
# Build pipeline
  # Perform NLP task: convert the data to vectors of numbers 
text_clf_nb = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])

In [0]:
# Define a set of parameters
tuned_parameters_nb = {
    # The number of words in a sequence
      # (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, (2, 2) means only bigrams
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    # A boolean, enable inverse-document-frequency reweighting
    'tfidf__use_idf': (True, False),
    # Each output row will have norm either l1: Sum of absolute values of vector elements is 1
                                       # or l2: Sum of squares of vector elements is 1. 
    'tfidf__norm': ('l1', 'l2'),
    # Additive smoothing parameter
    'clf__alpha': [1, 0.5, 0.1, 0.05, 0.01],
    # A boolean, whether to learn class prior probabilities or not. If false, a uniform prior will be used.
    'clf__fit_prior': (True, False),
}

In [0]:
# Tune hyperparamters using grid search with 10 folds cross validations
  # verbose: Controls the verbosity. The higher, the more messages.
clf = GridSearchCV(text_clf_nb, tuned_parameters_nb, cv=10, verbose = 1)
# Record the running time of training the model.
start = timeit.default_timer()
# Fit training data
clf.fit(x_train, y_train)
stop = timeit.default_timer()
print('Training Time: ', stop - start)  

In [0]:
# Assess the model
print(classification_report(y_test, clf.predict(x_test), digits=4))

                                                                            precision    recall  f1-score   support

                                              Artificial sugar substitutes     0.4286    0.6000    0.5000        10
                                                        Beef and veal meat     1.0000    0.6923    0.8182        13
                                                                     Bread     0.8475    0.9091    0.8772        55
                                                         Breakfast cereals     0.9077    0.9516    0.9291        62
                                                                    Butter     0.8667    0.9286    0.8966        14
                                                           Cheese and curd     0.9802    0.9659    0.9730       205
                                                                     Chips     0.8333    0.9091    0.8696        33
                                                                 Chocol

In [0]:
# Get the set of optimal hyperparameters
print(clf.best_params_)

{'clf__alpha': 0.01, 'clf__fit_prior': False, 'tfidf__norm': 'l1', 'tfidf__use_idf': False, 'vect__ngram_range': (1, 2)}


In [0]:
# accuracy_score calculates the percentage of predicted labels that match corrsesponding true labels
accuracy_score(y_test, clf.predict(x_test))

0.9011695906432748

> 2) Linear Support Vector Machine 

In [0]:
# Build pipeline
  # Perform NLP task: convert the data to vectors of numbers 
text_clf_svm = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf',  svm.LinearSVC(random_state = 123))])

In [0]:
# Set a set of parameters
tuned_parameters_svm = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),  
    # dual: A boolean, (default=True). Select the algorithm to either solve the dual or primal optimization problem. 
      # Prefer dual=False when n_samples > n_features. 
    'clf__dual': (True, False),
    # Regularization parameter. The strength of the regularization is inversely proportional to C.
    'clf__C': [0.1, 1, 10], 
    # The maximum number of iterations to be run.
    'clf__max_iter': [2500, 5000]
    }

In [0]:
# Tune hyperparamters
clf = GridSearchCV(text_clf_svm, tuned_parameters_svm, cv = 5, verbose=1)
# Record the running time of training the model.
start = timeit.default_timer()
# Fit training data
clf.fit(x_train, y_train)
stop = timeit.default_timer()
print('Training Time: ', stop - start) 

Fitting 5 folds for each of 144 candidates, totalling 720 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 720 out of 720 | elapsed: 31.4min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        prep

In [0]:
# Assess the model
print(classification_report(y_test, clf.predict(x_test), digits=4))

                                                                            precision    recall  f1-score   support

                                              Artificial sugar substitutes     0.8750    0.7000    0.7778        10
                                                        Beef and veal meat     0.9286    1.0000    0.9630        13
                                                                     Bread     0.8475    0.9091    0.8772        55
                                                         Breakfast cereals     1.0000    0.9516    0.9752        62
                                                                    Butter     0.8125    0.9286    0.8667        14
                                                           Cheese and curd     0.9665    0.9854    0.9758       205
                                                                     Chips     0.9333    0.8485    0.8889        33
                                                                 Chocol

In [0]:
# Get the set of optimal hyperparameters
print(clf.best_estimator_)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 2), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 LinearSVC(C=1, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                       

In [0]:
accuracy_score(y_test, clf.predict(x_test))

0.9149122807017543

> 3) Logistic Regression

In [0]:
# Build pipeline
  # Perform NLP task: convert the data to vectors of numbers 
text_clf_log = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', linear_model.LogisticRegression())])

In [0]:
# Define a set of parameters
tuned_parameters_log = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__C': [1, 2],
    'clf__max_iter': [500, 1000]
}

In [0]:
# Tune hyperparamters using grid search with 5 folds cross validations
clf = GridSearchCV(text_clf_log, tuned_parameters_log, cv = 5, verbose = 1)
# Record the running time of training the model.
start = timeit.default_timer()
# Fit training data
clf.fit(x_train, y_train)
stop = timeit.default_timer()
print('Training Time: ', stop - start) 

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed: 127.1min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        prep

In [0]:
# Assess the model
print(classification_report(y_test, clf.predict(x_test), digits=4))

                                                                            precision    recall  f1-score   support

                                              Artificial sugar substitutes     0.8750    0.7000    0.7778        10
                                                        Beef and veal meat     0.8462    0.8462    0.8462        13
                                                                     Bread     0.8750    0.8909    0.8829        55
                                                         Breakfast cereals     0.9219    0.9516    0.9365        62
                                                                    Butter     0.7778    1.0000    0.8750        14
                                                           Cheese and curd     0.9481    0.9805    0.9640       205
                                                                     Chips     0.9000    0.8182    0.8571        33
                                                                 Chocol

In [0]:
# Get the optimal hyperparameters
print(clf.best_estimator_)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=Non...
                 RandomForestClassifier(bootstrap=False, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=None, max_features='log2',
                                        max_leaf_nodes=None, max_samples=None,
                               

In [0]:
accuracy_score(y_test, clf.predict(x_test))

0.902046783625731

> 4) Random Forest

In [0]:
# Build pipeline
  # Perform NLP task: convert the data to vectors of numbers 
text_clf_rf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', RandomForestClassifier())])

In [0]:
# Set a set of parameters
tuned_parameters_rf = {
    # 'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'), 
    # The number of trees in the forest.
    'clf__n_estimators': [100, 200],
    # The number of features to consider when looking for the best split:
    'clf__max_features': ('auto','log2'),
    # Whether bootstrap samples are used when building trees. 
      # If False, the whole datset is used to build each tree.
    'clf__bootstrap': (True, False)
              }

In [0]:
# Tune hyperparameters using 5 fold cross validation, 
clf= GridSearchCV(text_clf_rf, tuned_parameters_rf, cv=5, verbose=1)
# Record the running time of training the model.
start = timeit.default_timer()
# Fit training data
clf.fit(x_train, y_train)
stop = timeit.default_timer()
print('Training Time: ', stop - start) 

Fitting 5 folds for each of 32 candidates, totalling 160 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 160 out of 160 | elapsed: 58.7min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        prep

In [0]:
# Assess the model
print(classification_report(y_test, clf.predict(x_test), digits=4))

                                                                            precision    recall  f1-score   support

                                              Artificial sugar substitutes     1.0000    0.6000    0.7500        10
                                                        Beef and veal meat     0.7857    0.8462    0.8148        13
                                                                     Bread     0.8571    0.8727    0.8649        55
                                                         Breakfast cereals     0.9365    0.9516    0.9440        62
                                                                    Butter     0.7778    1.0000    0.8750        14
                                                           Cheese and curd     0.9571    0.9805    0.9687       205
                                                                     Chips     0.9032    0.8485    0.8750        33
                                                                 Chocol

In [0]:
# Get the optimal hyperparameters
print(clf.best_estimator_)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=Non...
                 RandomForestClassifier(bootstrap=False, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=None, max_features='log2',
                                        max_leaf_nodes=None, max_samples=None,
                               

In [0]:
# accuracy_score calculates the percentage of predicted labels that match corrsesponding true labels
accuracy_score(y_test, clf.predict(x_test))

> 5) XGBoost

In [0]:
# Build pipeline
  # Perform NLP task: convert the data to vectors of numbers 
text_clf_xg = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', xgb.XGBClassifier(objective = 'multi:softmax', n_class = 61))])

In [0]:
# Define a set of parameters
tuned_parameters_xg = {
    # 'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    # 'tfidf__use_idf': (True, False),
    # 'tfidf__norm': ('l1', 'l2'),
    # Subsample ratio of the training instances.
    'clf__subsample': [0.5, 1],
    # Use single precision to build histograms. 
    'clf__single_precision_histogram': (True, False),
    # Build histogram on GPU deterministically.
    'clf__deterministic_histogram': (True, False)
}

In [0]:
# Tune hyperparamters using grid search with 5 folds cross validations
clf = GridSearchCV(text_clf_xg, tuned_parameters_xg, cv = 5, verbose = 1)
# Record the running time of training the model.
start = timeit.default_timer()
# Fit training data
clf.fit(x_train, y_train)
stop = timeit.default_timer()
print('Training Time: ', stop - start) 

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  40 out of  40 | elapsed: 44.7min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        prep

In [0]:
# Assess the model
print(classification_report(y_test, clf.predict(x_test), digits=4))

                                                                            precision    recall  f1-score   support

                                              Artificial sugar substitutes     0.8750    0.7000    0.7778        10
                                                        Beef and veal meat     0.7692    0.7692    0.7692        13
                                                                     Bread     0.8519    0.8364    0.8440        55
                                                         Breakfast cereals     0.9298    0.8548    0.8908        62
                                                                    Butter     0.6364    1.0000    0.7778        14
                                                           Cheese and curd     0.9333    0.9561    0.9446       205
                                                                     Chips     0.9032    0.8485    0.8750        33
                                                                 Chocol

  _warn_prf(average, modifier, msg_start, len(result))


In [0]:
# Get the optimal hyperparameters
print(clf.best_estimator_)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=Non...
                               colsample_bytree=1, deterministic_histogram=True,
                               gamma=0, learning_rate=0.1, max_delta_step=0,
                               max_depth=3, min_child_weight=1, missing=None,
                               n_class=61, n_estimators=100, n_jobs=1,
                             

In [0]:
accuracy_score(y_test, clf.predict(x_test))

0.8309941520467836

> 6) Neural Network

In [0]:
# Prepare training data and test data
x_train, x_test, y_train, y_test = train_test_split(df['Desc_E'], df['Code_E'], test_size=0.20, random_state=0)
# Set max_features = 3000 
  # When only extract unigrams, the number of features in x_test = 3006 < the number of features in x_train = 5494
vectorizer = TfidfVectorizer(max_features=3000)
x_train_tfidf = vectorizer.fit_transform(x_train).toarray()
x_test_tfidf = vectorizer.transform(x_test).toarray()

In [0]:
# Build Deep Neural Network with one hidden layer
model = Sequential()
# Number of nodes.
  # Randomly assigned.
nodes = 500
# Input shape.
shape = x_train_tfidf.shape[1]
# Number of classes.
n_classes = 61
# Dropout rate.
dropout = 0.5
model.add(Dense(nodes,input_dim=shape,activation='relu'))
model.add(Dropout(dropout))
model.add(Dense(nodes,input_dim=nodes,activation='relu'))
model.add(Dropout(dropout))
model.add(Dense(n_classes, activation='softmax'))
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
# Set early stopping monitor so the model stops training when it won't improve anymore 
  #  In this case, we consider the model stops when it cannot reduce mean squared error by 0.1% for accuracy.
es = EarlyStopping(monitor='val_accuracy', mode='max', min_delta=0.1)

In [0]:
# Record the running time of training the model.
start = timeit.default_timer()
# Fit training data
model.fit(x_train_tfidf, y_train, 
          validation_data=(x_test_tfidf, y_test), 
          epochs=10, 
          batch_size=200, 
          verbose=1, 
          callbacks = [es])
stop = timeit.default_timer()
print('Training Time: ', stop - start) 

Train on 13679 samples, validate on 3420 samples
Epoch 1/10
Epoch 2/10
  600/13679 [>.............................] - ETA: 3s - loss: 2.2605 - acc: 0.5817



Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fd6cba0b208>

## Reference

[ref_INEGI] INEGI pilot study report 

Sklearn documentation:
* https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html

* https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

* https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html 

* https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

XGBoost Documentation:
* https://xgboost.readthedocs.io/en/latest/parameter.html

Keras Documentation:
* https://keras.io/getting-started/sequential-model-guide/