# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [53]:
# import libraries
import os
import re
import numpy as np
import pandas as pd
import datetime
import pickle
from sqlalchemy import create_engine

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import confusion_matrix, multilabel_confusion_matrix
from sklearn.metrics import accuracy_score, f1_score, make_scorer, classification_report
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV

from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import ComplementNB
from sklearn.neighbors import KNeighborsClassifier

#### Load data from database

In [2]:
engine = create_engine('sqlite:///../../DisasterResponse.db')
df = pd.read_sql('messages', con=engine)
X = df[['message']]
Y = df.iloc[:,4:]

### 2. Write a tokenization function to process your text data

In [3]:
def tokenize(text):
    """Tokenize the provided text by
        * removing punctuation
        * tokenizing into words
        * removing stopwords
        * lemmatizing, including conversion to lowercase
    """
    
    # remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    
    # tokenize into words
    tokens = word_tokenize(text)
    
    # remove stopwords
    tokens = [t for t in tokens if t not in stopwords.words("english")]
    
    # lemmatize, including conversion to lowercase
    lemmatizer = WordNetLemmatizer()
    clean_tokens = [lemmatizer.lemmatize(tok).lower().strip() for tok in tokens]
    
    return clean_tokens

### 2.A Test individual pipeline steps

Perform train test split

In [4]:
n_samples = X.shape[0]
X_train, X_test, Y_train, Y_test = train_test_split(X.message.iloc[:n_samples], Y.iloc[:n_samples], test_size=0.25)

#### Count Vectorizer

In [5]:
vect = CountVectorizer(tokenizer=tokenize)
%time X_train_vect  = vect.fit_transform(X_train)

Wall time: 3min 59s


#### Tf-idf Transformer

In [6]:
tfidf = TfidfTransformer()
%time X_train_tfidf = tfidf.fit_transform(X_train_vect)

Wall time: 27 ms


#### Train Classifier (Random Forest)

In [7]:
clf = MultiOutputClassifier(RandomForestClassifier(), n_jobs=16)
%time clf.fit(X_train_tfidf, Y_train)

Wall time: 1min 1s


MultiOutputClassifier(estimator=RandomForestClassifier(), n_jobs=16)

#### Vectorize test data

In [8]:
%time X_test_vect  = vect.transform(X_test)

Wall time: 1min 15s


#### Transform test data

In [9]:
%time X_test_tfidf = tfidf.transform(X_test_vect)

Wall time: 5 ms


#### Predict test data

In [10]:
%time Y_pred = clf.predict(X_test_tfidf)

Wall time: 8.47 s


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [114]:
pipeline = Pipeline([
    ('countVectorizer', CountVectorizer(tokenizer=tokenize, max_df=0.5, max_features=1000)),
    ('TFIDF', TfidfTransformer()),
    ('MultiRandomForest', MultiOutputClassifier(RandomForestClassifier(), n_jobs=12))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

#### Perform train test split

In [None]:
n_samples = X.shape[0]
X_train, X_test, Y_train, Y_test = train_test_split(X.message.iloc[:n_samples], Y.iloc[:n_samples], test_size=0.25)

In [None]:
%time pipeline.fit(X_train, Y_train)

In [None]:
%time Y_pred = pipeline.predict(X_test)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [None]:
report = classification_report(Y_test, Y_pred, target_names=Y_test.columns, zero_division=0)
print(report)

### 6. Improve your model
Use grid search to find better parameters.
(Searching for optimal parameters for CountVectorizer and RandomForest simultaneously took too long)

#### RandomForest

In [115]:
parameters = {
        'countVectorizer__max_df': (0.25, 0.5, 0.75),
        'countVectorizer__max_features': (500, 1000, 5000),
        'MultiRandomForest__estimator__n_estimators': [50, 100, 200],
        'MultiRandomForest__estimator__min_samples_split': [3, 4, 5],
    }

# Score by f1 score
f1_weighted_score = make_scorer(f1_score, average='weighted', zero_division=0)

cv = GridSearchCV(pipeline, param_grid=parameters, scoring=f1_weighted_score, n_jobs=1, verbose=5, cv=2)

In [116]:
%time cv.fit(X_train, Y_train)

Fitting 2 folds for each of 81 candidates, totalling 162 fits
[CV] MultiRandomForest__estimator__min_samples_split=3, MultiRandomForest__estimator__n_estimators=50, countVectorizer__max_df=0.25, countVectorizer__max_features=500 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  MultiRandomForest__estimator__min_samples_split=3, MultiRandomForest__estimator__n_estimators=50, countVectorizer__max_df=0.25, countVectorizer__max_features=500, score=0.510, total= 3.9min
[CV] MultiRandomForest__estimator__min_samples_split=3, MultiRandomForest__estimator__n_estimators=50, countVectorizer__max_df=0.25, countVectorizer__max_features=500 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  3.9min remaining:    0.0s


[CV]  MultiRandomForest__estimator__min_samples_split=3, MultiRandomForest__estimator__n_estimators=50, countVectorizer__max_df=0.25, countVectorizer__max_features=500, score=0.495, total= 4.1min
[CV] MultiRandomForest__estimator__min_samples_split=3, MultiRandomForest__estimator__n_estimators=50, countVectorizer__max_df=0.25, countVectorizer__max_features=1000 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  8.0min remaining:    0.0s


[CV]  MultiRandomForest__estimator__min_samples_split=3, MultiRandomForest__estimator__n_estimators=50, countVectorizer__max_df=0.25, countVectorizer__max_features=1000, score=0.504, total= 4.1min
[CV] MultiRandomForest__estimator__min_samples_split=3, MultiRandomForest__estimator__n_estimators=50, countVectorizer__max_df=0.25, countVectorizer__max_features=1000 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed: 12.1min remaining:    0.0s


[CV]  MultiRandomForest__estimator__min_samples_split=3, MultiRandomForest__estimator__n_estimators=50, countVectorizer__max_df=0.25, countVectorizer__max_features=1000, score=0.496, total= 4.1min
[CV] MultiRandomForest__estimator__min_samples_split=3, MultiRandomForest__estimator__n_estimators=50, countVectorizer__max_df=0.25, countVectorizer__max_features=5000 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed: 16.3min remaining:    0.0s


[CV]  MultiRandomForest__estimator__min_samples_split=3, MultiRandomForest__estimator__n_estimators=50, countVectorizer__max_df=0.25, countVectorizer__max_features=5000, score=0.494, total= 4.1min
[CV] MultiRandomForest__estimator__min_samples_split=3, MultiRandomForest__estimator__n_estimators=50, countVectorizer__max_df=0.25, countVectorizer__max_features=5000 
[CV]  MultiRandomForest__estimator__min_samples_split=3, MultiRandomForest__estimator__n_estimators=50, countVectorizer__max_df=0.25, countVectorizer__max_features=5000, score=0.482, total= 4.1min
[CV] MultiRandomForest__estimator__min_samples_split=3, MultiRandomForest__estimator__n_estimators=50, countVectorizer__max_df=0.5, countVectorizer__max_features=500 
[CV]  MultiRandomForest__estimator__min_samples_split=3, MultiRandomForest__estimator__n_estimators=50, countVectorizer__max_df=0.5, countVectorizer__max_features=500, score=0.510, total= 4.2min
[CV] MultiRandomForest__estimator__min_samples_split=3, MultiRandomForest__

[Parallel(n_jobs=1)]: Done 162 out of 162 | elapsed: 707.1min finished


Wall time: 11h 51min 47s


GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('countVectorizer',
                                        CountVectorizer(max_df=0.5,
                                                        max_features=1000,
                                                        tokenizer=<function tokenize at 0x000002288BF6D430>)),
                                       ('TFIDF', TfidfTransformer()),
                                       ('MultiRandomForest',
                                        MultiOutputClassifier(estimator=RandomForestClassifier(),
                                                              n_jobs=12))]),
             n_jobs=1,
             param_grid={'MultiRandomForest__estimator__min_samples_split': [3,
                                                                             4,
                                                                             5],
                         'MultiRandomForest__estimator__n_estimators': [50, 100,
                 

In [117]:
cv.cv_results_

{'mean_fit_time': array([118.39256537, 125.98731017, 126.05974543, 125.99281204,
        125.35403657, 126.05767691, 126.36557055, 125.72985053,
        126.15377951, 131.1169883 , 131.21624613, 133.0197022 ,
        132.44889271, 136.58924639, 132.51555538, 131.07609379,
        132.20216656, 133.17715418, 142.90154326, 145.35490394,
        160.9822619 , 146.34479785, 144.52763486, 147.31619978,
        144.32023895, 143.69905245, 147.71126854, 125.93358123,
        125.88600457, 125.54132164, 125.95784736, 125.71651256,
        125.42670035, 126.34057438, 125.06870341, 126.32540357,
        131.13095748, 132.29757607, 131.9879905 , 131.47109687,
        131.52917254, 132.7994287 , 130.93714499, 131.13412821,
        132.95561182, 143.1496489 , 142.51382816, 145.85423803,
        144.08951056, 145.18944931, 146.92610919, 143.49140823,
        143.55804884, 145.7858485 , 125.23681772, 125.50565124,
        126.04740977, 125.98699558, 125.4093523 , 125.44807136,
        125.81736982, 1

In [118]:
best_param_index = np.argmin(cv.cv_results_['rank_test_score'])
print(f"Best param_countVectorizer__max_df: {cv.cv_results_['param_countVectorizer__max_df'][best_param_index]}")
print(f"Best param_countVectorizer__max_features: {cv.cv_results_['param_countVectorizer__max_features'][best_param_index]}")
print(f"Best param_MultiRandomForest__estimator__n_estimators {cv.cv_results_['param_MultiRandomForest__estimator__n_estimators'][best_param_index]}")
print(f"Best param_MultiRandomForest__estimator__min_samples_split: {cv.cv_results_['param_MultiRandomForest__estimator__min_samples_split'][best_param_index]}")

Best param_countVectorizer__max_df: 0.75
Best param_countVectorizer__max_features: 500
Best param_MultiRandomForest__estimator__n_estimators 100
Best param_MultiRandomForest__estimator__min_samples_split: 4


In [119]:
outpath = r"../../models"
modeltype = list(cv.best_estimator_.named_steps)[-1]
score   = f"{int(cv.best_score_ * 1e6):06d}"
date    = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
outfilename = os.path.join(outpath, f"{modeltype}_{score}_{date}.model")
with open(outfilename, 'wb') as outfile:
    pickle.dump(cv, outfile)
print(f"Wrote file {outfilename}")

Wrote file ../../models\MultiRandomForest_505588_20201201_104554.model


#### Complement Naive Bayes

In [69]:
pipeline = Pipeline([
    ('countVectorizer', CountVectorizer(tokenizer=tokenize)),
    ('TFIDF', TfidfTransformer()),
    ('ComplementNB', MultiOutputClassifier(ComplementNB(), n_jobs=14))
])

In [70]:
parameters = {
        'countVectorizer__max_df': (0.25, 0.5, 0.75),
        'countVectorizer__max_features': (500, 1000, 5000),
        'ComplementNB__estimator__alpha': [0, 0.25, 0.5, 0.75, 1.0],
    }

# Score by f1 score
f1_weighted_score = make_scorer(f1_score, average='weighted', zero_division=0)

cv = GridSearchCV(pipeline, param_grid=parameters, scoring=f1_weighted_score, n_jobs=1, verbose=5, cv=2)

In [71]:
%time cv.fit(X_train, Y_train)

Fitting 2 folds for each of 45 candidates, totalling 90 fits
[CV] ComplementNB__estimator__alpha=0, countVectorizer__max_df=0.25, countVectorizer__max_features=500 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  ComplementNB__estimator__alpha=0, countVectorizer__max_df=0.25, countVectorizer__max_features=500, score=0.445, total= 3.0min
[CV] ComplementNB__estimator__alpha=0, countVectorizer__max_df=0.25, countVectorizer__max_features=500 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  3.0min remaining:    0.0s


[CV]  ComplementNB__estimator__alpha=0, countVectorizer__max_df=0.25, countVectorizer__max_features=500, score=0.443, total= 3.5min
[CV] ComplementNB__estimator__alpha=0, countVectorizer__max_df=0.25, countVectorizer__max_features=1000 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  6.4min remaining:    0.0s


[CV]  ComplementNB__estimator__alpha=0, countVectorizer__max_df=0.25, countVectorizer__max_features=1000, score=0.452, total= 4.0min
[CV] ComplementNB__estimator__alpha=0, countVectorizer__max_df=0.25, countVectorizer__max_features=1000 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed: 10.4min remaining:    0.0s


[CV]  ComplementNB__estimator__alpha=0, countVectorizer__max_df=0.25, countVectorizer__max_features=1000, score=0.453, total= 4.0min
[CV] ComplementNB__estimator__alpha=0, countVectorizer__max_df=0.25, countVectorizer__max_features=5000 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed: 14.4min remaining:    0.0s


[CV]  ComplementNB__estimator__alpha=0, countVectorizer__max_df=0.25, countVectorizer__max_features=5000, score=0.430, total= 4.0min
[CV] ComplementNB__estimator__alpha=0, countVectorizer__max_df=0.25, countVectorizer__max_features=5000 
[CV]  ComplementNB__estimator__alpha=0, countVectorizer__max_df=0.25, countVectorizer__max_features=5000, score=0.426, total= 4.0min
[CV] ComplementNB__estimator__alpha=0, countVectorizer__max_df=0.5, countVectorizer__max_features=500 
[CV]  ComplementNB__estimator__alpha=0, countVectorizer__max_df=0.5, countVectorizer__max_features=500, score=0.445, total= 4.0min
[CV] ComplementNB__estimator__alpha=0, countVectorizer__max_df=0.5, countVectorizer__max_features=500 
[CV]  ComplementNB__estimator__alpha=0, countVectorizer__max_df=0.5, countVectorizer__max_features=500, score=0.443, total= 4.0min
[CV] ComplementNB__estimator__alpha=0, countVectorizer__max_df=0.5, countVectorizer__max_features=1000 
[CV]  ComplementNB__estimator__alpha=0, countVectorizer__

[Parallel(n_jobs=1)]: Done  90 out of  90 | elapsed: 353.7min finished


Wall time: 5h 57min 39s


GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('countVectorizer',
                                        CountVectorizer(tokenizer=<function tokenize at 0x000002288BF6D430>)),
                                       ('TFIDF', TfidfTransformer()),
                                       ('ComplementNB',
                                        MultiOutputClassifier(estimator=ComplementNB(),
                                                              n_jobs=14))]),
             n_jobs=1,
             param_grid={'ComplementNB__estimator__alpha': [0, 0.25, 0.5, 0.75,
                                                            1.0],
                         'countVectorizer__max_df': (0.25, 0.5, 0.75),
                         'countVectorizer__max_features': (500, 1000, 5000)},
             scoring=make_scorer(f1_score, average=weighted, zero_division=0),
             verbose=5)

In [72]:
cv.cv_results_

{'mean_fit_time': array([ 89.50700867, 119.4877404 , 119.87895346, 119.52050042,
        118.75770736, 118.65534925, 119.46136558, 118.96802056,
        119.46336019, 118.88865781, 119.23923147, 120.29551971,
        118.61435914, 118.68232822, 118.1525141 , 118.50276959,
        118.09186554, 118.25676048, 118.68215907, 118.01186311,
        118.18677056, 118.27757692, 118.1262157 , 118.38759995,
        118.15856576, 118.23968649, 118.82146788, 119.10368431,
        119.01632524, 118.41697466, 119.04488897, 117.86802018,
        118.92249215, 119.2589761 , 118.3115592 , 119.42037272,
        118.14375448, 118.65193927, 117.43929482, 117.16972113,
        117.29077709, 117.23652637, 116.55385864, 116.42628074,
        116.65570962]),
 'std_fit_time': array([1.0734762 , 0.88614357, 0.66457677, 0.17054272, 0.67632866,
        0.84413433, 0.01173007, 0.57982624, 1.25282013, 0.99604416,
        0.19750845, 1.06358087, 0.64857841, 1.47726297, 0.58128679,
        0.30657661, 0.44742846, 0.9

In [81]:
best_param_index = np.argmin(cv.cv_results_['rank_test_score'])
print(f"Best param_countVectorizer__max_df: {cv.cv_results_['param_countVectorizer__max_df'][best_param_index]}")
print(f"Best param_countVectorizer__max_features: {cv.cv_results_['param_countVectorizer__max_features'][best_param_index]}")
print(f"Best param_ComplementNB__estimator__alpha: {cv.cv_results_['param_ComplementNB__estimator__alpha'][best_param_index]}")

Best param_countVectorizer__max_df: 0.25
Best param_countVectorizer__max_features: 5000
Best param_ComplementNB__estimator__alpha: 0.25


In [73]:
outpath = r"../../models"
modeltype = list(cv.best_estimator_.named_steps)[-1]
score   = f"{int(cv.best_score_ * 1e6):06d}"
date    = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
outfilename = os.path.join(outpath, f"{modeltype}_{score}_{date}.model")
with open(outfilename, 'wb') as outfile:
    pickle.dump(cv, outfile)
print(f"Wrote file {outfilename}")

Wrote file ../../models\ComplementNB_463106_20201128_041450.model


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [51]:
print(f"Model accuracy: {accuracy_score(Y_test, Y_pred):.2f}")

Model accuracy: 0.42


In [52]:
report = classification_report(Y_test, Y_pred, target_names=Y_test.columns, zero_division=0)
print(report)

                        precision    recall  f1-score   support

               request       0.82      0.50      0.62      1073
                 offer       0.00      0.00      0.00        26
           aid_related       0.76      0.68      0.72      2658
          medical_help       0.64      0.09      0.16       505
      medical_products       0.81      0.08      0.15       308
     search_and_rescue       0.86      0.04      0.07       166
              security       1.00      0.01      0.01       136
              military       0.73      0.05      0.09       234
           child_alone       0.00      0.00      0.00         0
                 water       0.88      0.36      0.51       413
                  food       0.83      0.55      0.66       712
               shelter       0.81      0.41      0.54       581
              clothing       1.00      0.06      0.11       102
                 money       1.00      0.03      0.06       152
        missing_people       0.00      

Check prediction probabilities for individual samples

In [49]:
sample = 3
Y_pred = pd.DataFrame(Y_pred, columns=Y_test.columns, index=X_test.index)
pp = clf.predict_proba(X_test_tfidf[sample])
pred_proba = pd.DataFrame(Y_pred.iloc[sample])
pred_proba.columns  = ["prediction"]
pred_proba["truth"] = Y_test.iloc[sample]
pred_proba["proba_0"] = [a[0][0] for a in pp]
pred_proba["proba_1"] = [a[0][1] if len(a[0])>1 else 0.0 for a in pp]
print(X_test.iloc[sample])
pred_proba.style.applymap(lambda v: 'color: red' if v < 0.5 else 'color: green', subset=["proba_0", "proba_1"])

Needs and requirements: metal caskets/coffins, medical supplies, medicines, rubber gloves, masks, tents and office supplies, blankets, clothes, heavy equipment, power generators, water purifiers, forensic pathologists, freezer containers for bodies, communications equipment, dried or canned food and rice.


Unnamed: 0,prediction,truth,proba_0,proba_1
request,0,1,0.8,0.2
offer,0,0,1.0,0.0
aid_related,1,1,0.09,0.91
medical_help,0,1,0.81,0.19
medical_products,0,1,0.61,0.39
search_and_rescue,0,0,0.99,0.01
security,0,0,0.97,0.03
military,0,0,0.99,0.01
child_alone,0,0,1.0,0.0
water,0,1,0.61,0.39


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
 * Naive Bayes (Multinomial, Complement)
 * Nearest Neighbours (KNeighborsClassifier)

Model         | precision  |  recall | f1-score | Further parameters
--------------|------------|---------|----------|------------------------------
RandomForest  |      0.70  |    0.45 |     0.51 | max_df=0.5, max_features=1000
ComplementNB  |      0.47  |    0.36 |     0.39 | defaults
ComplementNB  |      0.36  |    0.77 |     0.46 | max_df=0.5, max_features=1000
MultinomialNB |      0.70  |    0.32 |     0.39 | max_df=0.5, max_features=1000
KNeighbors    |      0.62  |    0.12 |     0.20 | max_df=0.5, max_features=1000
KNeighbors    |            |         |     0.11 | n_neighbors=10
KNeighbors    |            |         |     0.08 | n_neighbors=20
KNeighbors    |            |         |     0.03 | n_neighbors=50
KNeighbors    |            |         |     0.01 | n_neighbors=100

**Conclusion:**
Use Random Forest for best f1-score and best precision, but ComplementNB for best recall

#### Complement Naive Bayes

In [None]:
pipeline = Pipeline([
    ('countVectorizer', CountVectorizer(tokenizer=tokenize, max_df=0.5, max_features=1000)),
    ('TFIDF', TfidfTransformer()),
    ('classifier', MultiOutputClassifier(SVC()))
])

In [None]:
%time pipeline.fit(X_train, Y_train)

In [None]:
%time Y_pred = pipeline.predict(X_test)

In [None]:
print(classification_report(Y_test, Y_pred, target_names=Y_test.columns, zero_division=0))

In [None]:
parameters = {
        'classifier__estimator__n_neighbors': [10, 20, 50, 100],
    }

# Score by f1 score
f1_weighted_score = make_scorer(f1_score, average='weighted', zero_division=0)

cv = GridSearchCV(pipeline, param_grid=parameters, scoring=f1_weighted_score, n_jobs=1, verbose=1, cv=2)

In [None]:
%time cv.fit(X_train, Y_train)

In [None]:
cv.cv_results_

### 9. Export your model as a pickle file
With the results from above, the best performing models are (now trained with the full dataset):

#### RandomForest
**highest f1-score** (0.51), **highest precision (0.71)**, sufficient recall (0.44), sufficient accuracy (0.41) [weighted averages]
* Random Forest with 
    * CountVectorizer max_df=0.25
    * CountVectorizer max_features=5000
    * RandomForest n_estimators=100
    * RandomForest min_samples_split=4

In [87]:
RandomForest_pipeline = Pipeline([
    ('countVectorizer', CountVectorizer(tokenizer=tokenize, max_df=0.25, max_features=5000)),
    ('TFIDF', TfidfTransformer()),
    ('MultiRandomForest', MultiOutputClassifier(RandomForestClassifier(n_estimators=100, min_samples_split=4), n_jobs=14))
])

In [88]:
%time RandomForest_pipeline.fit(X_train, Y_train)

Wall time: 3min 49s


Pipeline(steps=[('countVectorizer',
                 CountVectorizer(max_df=0.25, max_features=5000,
                                 tokenizer=<function tokenize at 0x000002288BF6D430>)),
                ('TFIDF', TfidfTransformer()),
                ('MultiRandomForest',
                 MultiOutputClassifier(estimator=RandomForestClassifier(min_samples_split=4),
                                       n_jobs=14))])

In [89]:
Y_pred = RandomForest_pipeline.predict(X_test)
print(f"Model accuracy: {accuracy_score(Y_test, Y_pred):.2f}")
print(classification_report(Y_test, Y_pred, target_names=Y_test.columns, zero_division=0))

Model accuracy: 0.41
                        precision    recall  f1-score   support

               request       0.81      0.51      0.63      1073
                 offer       0.00      0.00      0.00        26
           aid_related       0.73      0.71      0.72      2658
          medical_help       0.56      0.13      0.21       505
      medical_products       0.82      0.17      0.28       308
     search_and_rescue       0.59      0.10      0.17       166
              security       0.00      0.00      0.00       136
              military       0.74      0.12      0.21       234
           child_alone       0.00      0.00      0.00         0
                 water       0.86      0.49      0.62       413
                  food       0.82      0.70      0.76       712
               shelter       0.80      0.50      0.61       581
              clothing       0.82      0.18      0.29       102
                 money       0.80      0.03      0.05       152
        missing_pe

Refit with complete Dataset and pickle

In [112]:
%time RandomForest_pipeline.fit(X["message"], Y)

Wall time: 6min 13s


Pipeline(steps=[('countVectorizer',
                 CountVectorizer(max_df=0.25, max_features=5000,
                                 tokenizer=<function tokenize at 0x000002288BF6D430>)),
                ('TFIDF', TfidfTransformer()),
                ('MultiRandomForest',
                 MultiOutputClassifier(estimator=RandomForestClassifier(min_samples_split=4),
                                       n_jobs=14))])

In [113]:
outpath = r"../../models"
outfilename = os.path.join(outpath, f"RandomForest.model")
with open(outfilename, 'wb') as outfile:
    pickle.dump(RandomForest_pipeline, outfile)
print(f"Wrote file {outfilename}")

Wrote file ../../models\RandomForest.model


#### Complement Naive Bayes
**good f1-score** (0.46), sufficient precision (0.36), **highest recall** (0.77), sufficient accuracy (0.08) [weighted averages]
* Complement Naive Bayes with
    * CountVectorizer max_df=0.25
    * CountVectorizer max_features=5000
    * ComplementNB alpha=0.25

In [100]:
ComplementNB_pipeline = Pipeline([
    ('countVectorizer', CountVectorizer(tokenizer=tokenize, max_df=0.25, max_features=5000)),
    ('TFIDF', TfidfTransformer()),
    ('ComplementNB', MultiOutputClassifier(ComplementNB(alpha=0.25), n_jobs=14))
])

In [93]:
%time ComplementNB_pipeline.fit(X_train, Y_train)

Wall time: 3min 3s


Pipeline(steps=[('countVectorizer',
                 CountVectorizer(max_df=0.25, max_features=5000,
                                 tokenizer=<function tokenize at 0x000002288BF6D430>)),
                ('TFIDF', TfidfTransformer()),
                ('ComplementNB',
                 MultiOutputClassifier(estimator=ComplementNB(alpha=0.25),
                                       n_jobs=14))])

In [94]:
Y_pred = ComplementNB_pipeline.predict(X_test)
print(f"Model accuracy: {accuracy_score(Y_test, Y_pred):.2f}")
print(classification_report(Y_test, Y_pred, target_names=Y_test.columns, zero_division=0))

Model accuracy: 0.08
                        precision    recall  f1-score   support

               request       0.46      0.82      0.59      1073
                 offer       0.01      0.35      0.02        26
           aid_related       0.65      0.75      0.70      2658
          medical_help       0.22      0.67      0.33       505
      medical_products       0.17      0.75      0.28       308
     search_and_rescue       0.07      0.50      0.12       166
              security       0.05      0.41      0.09       136
              military       0.17      0.80      0.28       234
           child_alone       0.00      0.00      0.00         0
                 water       0.23      0.83      0.36       413
                  food       0.36      0.82      0.50       712
               shelter       0.28      0.81      0.42       581
              clothing       0.10      0.74      0.18       102
                 money       0.10      0.66      0.17       152
        missing_pe

Refit with complete Dataset and pickle

In [110]:
%time ComplementNB_pipeline.fit(X["message"], Y)

Wall time: 5min 14s


Pipeline(steps=[('countVectorizer',
                 CountVectorizer(max_df=0.25, max_features=5000,
                                 tokenizer=<function tokenize at 0x000002288BF6D430>)),
                ('TFIDF', TfidfTransformer()),
                ('ComplementNB',
                 MultiOutputClassifier(estimator=ComplementNB(alpha=0.25),
                                       n_jobs=14))])

In [111]:
outpath = r"../../models"
outfilename = os.path.join(outpath, f"ComplementNB.model")
with open(outfilename, 'wb') as outfile:
    pickle.dump(ComplementNB_pipeline, outfile)
print(f"Wrote file {outfilename}")

Wrote file ../../models\ComplementNB.model


### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.