In this notebook, we build a model for classifying message categories. This model will be a pipeline consisting of some natural language processing steps followed by a random forest classifier.

This ML pipeline is also automated in the script `models/train_classifier.py` (see project README).

In [13]:
# import libraries
%matplotlib inline
import matplotlib
import sys
import pandas as pd
import pickle as pkl
import re
import nltk

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sqlalchemy import create_engine
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (classification_report, precision_recall_fscore_support,
                             fbeta_score, make_scorer)
from sklearn.utils import parallel_backend
from sklearn.multioutput import MultiOutputClassifier
from sklearn.dummy import DummyClassifier

nltk.download(['punkt', 'stopwords', 'wordnet'])

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /Users/home/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/home/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/home/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Get features and response

In [2]:
def load_data(database_filepath):
    """Load feature and response data from SQL database file."""
    # load data from database
    engine = create_engine('sqlite:///' + database_filepath)
    table_name = database_filepath.replace('.db', '').split('/')[-1]
    df = pd.read_sql_table(table_name, engine)
    # message text as features
    X = df['message']
    # message category as response
    Y = df[df.columns.drop(['id', 'message', 'genre'])]
    category_names = Y.columns
    return X, Y, category_names

In [3]:
X, Y, category_names = load_data('../data/DisasterResponse.db')

The inputs/features are messages and the outputs/repsonse are binary classifications for each message category.

In [4]:
X.shape

(26216,)

In [5]:
X.head()

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
Name: message, dtype: object

In [6]:
Y.shape

(26216, 36)

In [7]:
Y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Tokenize message features

The following function will be the first step in the NLP pipeline

In [8]:
def tokenize(text):
    """Tokenize messages."""
    # replace urls with placeholder
    url_regex = ('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")
    # convert to lower case
    text = text.lower()
    # remove punctuation characters
    text = re.sub(r'[^a-zA-Z0-9]', ' ', text)
    # word tokens
    words = word_tokenize(text)
    # remove words with digits and stop words
    words = [w for w in words if w not in stopwords.words('english')]
    # lemmatize
    lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
    return lemmed

## Build pipeline

The pipeline consists of three steps:

1. A count vectorizer which tokenizes and converts the message data into a document-frequency matrix (rows are messages, columns are counts of words in the vocabulary)
2. A tf-idf transformer which converts the doc-freq matrix into a tf-idf matrix
3. A random forest classifier. The `scikit-learn` implementation supports [multilabel classification](https://scikit-learn.org/stable/modules/multiclass.html), that is, binary classification for multiple classes. We'll also weight the classes since they're highly imbalanced.



We will also train a pipeline with a multioutput dummy classifier for baseline. Since the output classes are very unbalanced, our dummy classifier will use the `stratified` strategy, which "generates predictions by respecting the training set's class distribution".

In [14]:
random_state = 27

dummy_pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(DummyClassifier(strategy='stratified'),
                                  n_jobs=-1))
])


rf_pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier(class_weight='balanced',
                                   n_jobs=-1, random_state=random_state))
])

## Train

### Train test split

In [15]:
# 80-20 train-test split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=random_state)

In [36]:
# fit dummy pipeline
dummy_pipeline.fit(X_train, Y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at 0x114e83430>,
                                 vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 MultiOutputClassifier(estimator=DummyClassifier(constant=None,
          

In [38]:
# fit multioutput rf pipeline
rf_pipeline.fit(X_train, Y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight='balanced',
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
                                        max_leaf_nodes=None, max

## Evaluate baseline models

We'll look at a standard classification evaluation metrics per category as well as overall. For overall metrics we'll look at both macro averages and micro averages. From [Introduction to Information Retrieval](https://nlp.stanford.edu/IR-book/pdf/13bayes.pdf): "Microaveraged results are therefore really a measure of effectiveness on
the large classes in a test collection. To get a sense of effectiveness on small
classes, you should compute macroaveraged results.

In [41]:
def evaluate_model(model, X_test, Y_test, category_names):
    """Print classification metrics per category and overall."""
    Y_pred = model.predict(X_test)
    # print accuracy, precision and recall for each category
    for (i, category) in enumerate(category_names):
        print(f'For message category {category}:\n')
        print(classification_report(Y_pred[:, i], Y_test[category]))

    prec, rec, f1, _ = precision_recall_fscore_support(Y_test, Y_pred,
                                                       average='macro')
    fbeta = fbeta_score(Y_test, Y_pred, beta=3, average='macro')
    print()
    print(f'Macro average precision: {prec}')
    print(f'Macro average recall: {rec}')
    print(f'Macro average f1: {f1}')
    print(f'Macro average fbeta, beta=3: {fbeta}')
    
    prec, rec, f1, _ = precision_recall_fscore_support(Y_test, Y_pred,
                                                       average='micro')
    fbeta = fbeta_score(Y_test, Y_pred, beta=3, average='micro')
    print()
    print(f'Micro average precision: {prec}')
    print(f'Micro average recall: {rec}')
    print(f'Micro average f1 score: {f1}')
    print(f'Micro average fbeta, beta=3: {fbeta}')

In [42]:
# evaluate dummy model
evaluate_model(dummy_pipeline, X_test, Y_test, category_names)

For message category related:

              precision    recall  f1-score   support

           0       0.24      0.26      0.25      1235
           1       0.77      0.75      0.76      4009

    accuracy                           0.63      5244
   macro avg       0.50      0.50      0.50      5244
weighted avg       0.64      0.63      0.64      5244

For message category request:

              precision    recall  f1-score   support

           0       0.83      0.82      0.83      4342
           1       0.18      0.19      0.19       902

    accuracy                           0.72      5244
   macro avg       0.51      0.51      0.51      5244
weighted avg       0.72      0.72      0.72      5244

For message category offer:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5224
           1       0.00      0.00      0.00        20

    accuracy                           0.99      5244
   macro avg       0.50      0.50    

In [43]:
# evaluate dummy model
evaluate_model(rf_pipeline, X_test, Y_test, category_names)

For message category related:

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       1.00      0.75      0.86      5244

    accuracy                           0.75      5244
   macro avg       0.50      0.37      0.43      5244
weighted avg       1.00      0.75      0.86      5244

For message category request:

              precision    recall  f1-score   support

           0       1.00      0.82      0.90      5242
           1       0.00      1.00      0.00         2

    accuracy                           0.82      5244
   macro avg       0.50      0.91      0.45      5244
weighted avg       1.00      0.82      0.90      5244

For message category offer:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5244
           1       0.00      0.00      0.00         0

    accuracy                           1.00      5244
   macro avg       0.50      0.50    

### Conclusions

The random forest classifier shows considerable improvement over baseline especially with regard to recall.

For this application, it makes sense to favor recall over precision. To see this, recall that precision is

$$ \frac{tp}{tp + fp} $$

where $tp$ is true positives and $fp$ false positives while recall is

$$ \frac{tp}{tp + fn} $$

where $fn$ is false negatives. Thus low precision means a high number of false positives and low recall means a high number of false negatives.

In the context of our classification problem, a false positive is a message wrongly getting labeled in a particular response category (e.g. a message getting categorized as `water` related when it isn't) while a false negative is a message not being properly labeled in the first place (e.g. a `water` related failing to be categorized as such). 

So (to simplify) a false positive leads to a message being routed to the wrong responders, while a false negative leads to the message not being routed in the first place. The cost of the latter is much higher in my view.

## Tune parameters

We'll use gridsearch to tune some of the random forest hyperparameters. For scoring we'll use the micro-averaged [F-beta score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.html), which is similar to f1 score but allows to one weight recall higher than precision (or vice versa).

In [96]:
rf_pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                   dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                   lowercase=True, max_df=1.0, max_features=None, min_df=1,
                   ngram_range=(1, 1), preprocessor=None, stop_words=None,
                   strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                   tokenizer=<function tokenize at 0x12465c670>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight='balanced',
                          criterion='gini', max_depth=None, max_features='auto',
                          max_leaf_nodes=None, max_samples=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                   

In [26]:
def build_model():
    """Build gridseach cv pipeline."""
    # pipeline
    rf_pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize, min_df=2)),
        ('tfidf', TfidfTransformer()),
        ('clf', RandomForestClassifier(class_weight='balanced',
                                       n_jobs=-1))
        ])
    # grid search parameters, only 1 value each for testing
    params = {'clf__class_weight': ['balanced'],
              'clf__max_depth': [3],
              'clf__max_features': ['auto'],
              'clf__min_samples_leaf': [2],
              'clf__n_estimators': [10],
              'clf__n_jobs': [-1],
              'clf__random_state': [27],
              }
    # use fbeta scorer, favoring recall over precision, and taking
    # weighted average over categories to reflect class imbalances
    fbeta_scorer = make_scorer(fbeta_score, beta=3, average='micro')
    # small grid search for tuning
    rf_pipeline_cv = GridSearchCV(rf_pipeline,
                                  params,
                                  scoring=fbeta_scorer,
                                  n_jobs=1,
                                  cv=5,
                                  pre_dispatch='2*n_jobs',
                                  verbose=1)
    return rf_pipeline_cv

In [27]:
model = build_model()
model.fit(X_train, Y_train)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  8.6min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=2,
                                                        ngram_range=(1, 1),
                                                        prep

## Evaluate tuned model

In [28]:
model.best_score_

0.4907171247997345

In [29]:
model.best_params_

{'clf__class_weight': 'balanced',
 'clf__max_depth': 3,
 'clf__max_features': 'auto',
 'clf__min_samples_leaf': 2,
 'clf__n_estimators': 10,
 'clf__n_jobs': -1,
 'clf__random_state': 27}

In [30]:
model.best_estimator_

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=2,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight='balanced',
                                        criterion='gini', max_depth=3,
                                        max_features='auto',
                                        max_leaf_nodes=None, max_sa

In [85]:
evaluate_model(model, X_test, Y_test, category_names)

For message category related:

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         0
           1       1.00      0.75      0.86      5244

    accuracy                           0.75      5244
   macro avg       0.50      0.37      0.43      5244
weighted avg       1.00      0.75      0.86      5244

For message category request:

              precision    recall  f1-score   support

           0       1.00      0.82      0.90      5235
           1       0.00      0.22      0.00         9

    accuracy                           0.82      5244
   macro avg       0.50      0.52      0.45      5244
weighted avg       1.00      0.82      0.90      5244

For message category offer:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5244
           1       0.00      0.00      0.00         0

    accuracy                           1.00      5244
   macro avg       0.50      0.50    

## Model Improvements

After some hyperparameter tuning, there is some improvement over the [baseline random forest model](#Evaluate-baseline-models), evaluation metrics are still relatively poor. More hyperparameter tuning could be beneficial, although in this situation its computationally intensive, so we'll look for some other methods of model improvement.

Since the target categories are highly imbalanced, we'll try for a more accurate weighting. We'll also try a multioutput classifier wrapper to fit separate random forest classifiers for each category.

In [95]:
def get_inverse_class_weights(Y):
    """Calculate inverse class weights directly from dataset."""
    class_weights = []
    n_samples = Y.shape[0]
    for category in Y.columns:
        val_weights = n_samples/Y[category].value_counts()
        weights = {val:weight for (val, weight) in 
                   zip(val_weights.index.values, val_weights.values)}
        class_weights += [weights]
    return class_weights

In [96]:
# random forest classifier with inverse class weights directly derived from data
class_weights = get_class_weights(Y)
rf_pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier(class_weight=class_weights,
                                   n_jobs=-1, random_state=random_state))
])
rf_pipeline.fit(X_train, Y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                                                      {0: 0.9177982911199267,
                                                       1: 0.08220170888007323}, ...],
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
                                    

In [97]:
evaluate_model(rf_pipeline, X_test, Y_test, category_names)

For message category related:

              precision    recall  f1-score   support

           0       0.39      0.72      0.50       702
           1       0.95      0.82      0.88      4542

    accuracy                           0.81      5244
   macro avg       0.67      0.77      0.69      5244
weighted avg       0.87      0.81      0.83      5244

For message category request:

              precision    recall  f1-score   support

           0       0.99      0.88      0.93      4816
           1       0.39      0.86      0.54       428

    accuracy                           0.88      5244
   macro avg       0.69      0.87      0.74      5244
weighted avg       0.94      0.88      0.90      5244

For message category offer:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5244
           1       0.00      0.00      0.00         0

    accuracy                           1.00      5244
   macro avg       0.50      0.50    

In [93]:
# multioutput classifier with balanced class weights
rf_pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier(class_weight='balanced',
                                   n_jobs=-1, random_state=random_state),
                                 n_jobs=-1))
])
rf_pipeline.fit(X_train, Y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                                                                        ccp_alpha=0.0,
                                                                        class_weight='balanced',
                                                                        criterion='gini',
                                                             

In [87]:
evaluate_model(rf_pipeline, X_test, Y_test, category_names)

For message category related:

              precision    recall  f1-score   support

           0       0.52      0.72      0.60       951
           1       0.93      0.85      0.89      4293

    accuracy                           0.83      5244
   macro avg       0.72      0.78      0.74      5244
weighted avg       0.86      0.83      0.84      5244

For message category request:

              precision    recall  f1-score   support

           0       0.97      0.91      0.94      4629
           1       0.53      0.81      0.65       615

    accuracy                           0.89      5244
   macro avg       0.75      0.86      0.79      5244
weighted avg       0.92      0.89      0.90      5244

For message category offer:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5244
           1       0.00      0.00      0.00         0

    accuracy                           1.00      5244
   macro avg       0.50      0.50    

The multioutput pipeline appears to have the best performance thus far, as judged by macro scores. We'll optimize hyperparameters of this model with grid search in the final training script `train_classifier.py`.

The class weights we used above seemed to improve the random forest pipeline although it isn't straightforward input them to the multioutput pipeline. This is one possible direction for further model improvement. Another is to find a way to tune the individual classifiers in the multioutput pipeline separately -- currently they all share the same hyperparameter values.

## Pickle model

In [None]:
def save_model(model, model_filepath):
    """Pickle model to filepath."""
    with open(model_filepath, 'wb') as model_file:
        pkl.dump(model, model_file)