# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [48]:
# import libraries
from sqlalchemy import create_engine
import os
import pandas as pd
import re
import numpy as np
import timeit
from pickle import dump

# NLP imports
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

#scikit learn imports
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, ParameterGrid
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import confusion_matrix, classification_report, f1_score
from sklearn.linear_model import LogisticRegression, SGDClassifier

In [49]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')

# see table names to verify correct import to db
engine.table_names()

['disaster_resp']

In [50]:
# create dataframe using table in db
df = pd.read_sql_table('disaster_resp',engine) 

In [51]:
# split into dependent and independent variables
X = df['message']

# target variable is all columns that have category data
Y = df.iloc[:,4:]

### 2. Write a tokenization function to process your text data

In [52]:
def tokenize(text):
    # normalize: lowercase and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # tokenize using words
    tokens = word_tokenize(text)
    
    # remove stop words
    tokens = [t for t in tokens if t not in stopwords.words("english")]
    
    #lemmatize words
    lemmatizer = WordNetLemmatizer()
    clean_tokens = [lemmatizer.lemmatize(tok).strip() for tok in tokens]
    
    return clean_tokens

In [53]:
# May need to download punkt package.
# Uncomment below if needed

#import nltk
#nltk.download()

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [54]:
# from ETL, determined a column has a single label. Used randomforest classifier
# since it can handle single labels
# in MultiOutputClassifier, 'n-jobs = -1' means using all processors

pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [55]:
# split into training and test data.
# set random state for reproducibility 
# split into dependent and independent variables

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = .30, random_state = 1)

In [56]:
# train classifier using training data
# if want to supress warnings, uncomment next line
import warnings; warnings.simplefilter('ignore')
start = timeit.default_timer()

pipeline.fit(X_train, y_train)

stop = timeit.default_timer()
print('Training time (in seconds): ', stop - start)


Training time (in seconds):  233.68115570000373


In [57]:
# predict on test data
Y_pred = pipeline.predict(X_test)

Should use F1 scores to assess the models. This is because f1 scores is more appropriate metric for unbalanced data, which is the case with this dataset. 

There are four possible choices with our dataset for f1 scores:
 1. __Micro__: Calculate metrics globally by counting the total true positives, false negatives and false positives.

 2. __Macro__: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

 3. __Weighted__: Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

 4. __Samples__: Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from accuracy_score).

In [58]:
# get f1 score for this model since it is using imbalanced ata
print('Micro F1 Score: ' + str(f1_score(y_test, Y_pred, average = 'micro')))
print('Macro F1 Score: ' + str(f1_score(y_test, Y_pred, average = 'macro')))
print('Weighted F1 Score: ' + str(f1_score(y_test, Y_pred, average = 'weighted')))
print('Samples F1 Score: ' + str(f1_score(y_test, Y_pred, average = 'samples')))

Micro F1 Score: 0.5944503197864733
Macro F1 Score: 0.23056060745281387
Weighted F1 Score: 0.5286867606863018
Samples F1 Score: 0.4647858210861778


### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [59]:
# scoring the model, and seeing scores for each columns classification
# the data we are using is imbalanced, so should use F1 score
# will still look at other metrics

column_list = list(Y.columns)
for i in column_list:
    y_true = y_test.to_numpy()[:,column_list.index(i)]
    y_pred = Y_pred[:,column_list.index(i)]
    print(column_list[column_list.index(i)]+ ':')
    print(classification_report(y_true, y_pred) + '\n')

related:
              precision    recall  f1-score   support

           0       0.65      0.46      0.54      1928
           1       0.84      0.92      0.88      5937

    accuracy                           0.81      7865
   macro avg       0.75      0.69      0.71      7865
weighted avg       0.79      0.81      0.80      7865


request:
              precision    recall  f1-score   support

           0       0.90      0.97      0.93      6511
           1       0.77      0.47      0.58      1354

    accuracy                           0.88      7865
   macro avg       0.83      0.72      0.76      7865
weighted avg       0.88      0.88      0.87      7865


offer:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      7832
           1       0.00      0.00      0.00        33

    accuracy                           1.00      7865
   macro avg       0.50      0.50      0.50      7865
weighted avg       0.99      1.00      0.99    

### 6. Improve your model
Use grid search to find better parameters. 

In [60]:
# to see all estimators, use the following:
#pipeline.get_params().keys()

# make the list more tidy to only see parameters for the classifier
# iterate through dictionary keys to find parameters for the random forest classifier,
# which have clf as the first three letters in the parameter namme
[i for i in list(pipeline.get_params().keys()) if i.startswith('clf_')]

['clf__estimator__bootstrap',
 'clf__estimator__class_weight',
 'clf__estimator__criterion',
 'clf__estimator__max_depth',
 'clf__estimator__max_features',
 'clf__estimator__max_leaf_nodes',
 'clf__estimator__min_impurity_decrease',
 'clf__estimator__min_impurity_split',
 'clf__estimator__min_samples_leaf',
 'clf__estimator__min_samples_split',
 'clf__estimator__min_weight_fraction_leaf',
 'clf__estimator__n_estimators',
 'clf__estimator__n_jobs',
 'clf__estimator__oob_score',
 'clf__estimator__random_state',
 'clf__estimator__verbose',
 'clf__estimator__warm_start',
 'clf__estimator',
 'clf__n_jobs']

In [61]:
# make set of parameter values to test in GridSearchCV
params = {  'clf__estimator__bootstrap': [True, False],
            'clf__estimator__n_estimators': [5,10]#, 
            #'clf__estimator__max_features': ['log2', 'sqrt','auto'],
            #'clf__estimator__criterion': ['entropy', 'gini'], 
            #'clf__estimator__max_depth': [2, 3, 5, 10], 
            #'clf__estimator__min_samples_split': [2, 3, 5],
            #'clf__estimator__min_samples_leaf': [1,5,8] 
         }

random_forest_v1 = GridSearchCV(pipeline, param_grid=params, scoring='precision_samples', cv = None, n_jobs = -1)

In [62]:
# train model and measure how long this takes
start = timeit.default_timer()

random_forest_v1.fit(X_train, y_train)

stop = timeit.default_timer()
print('Training time (in seconds): ', stop - start)

Training time (in seconds):  937.4391565000042


In [63]:
start = timeit.default_timer()

Y_pred_rf1 = random_forest_v1.predict(X_test)

stop = timeit.default_timer()
print('Validation time (in seconds): ', stop - start)

Validation time (in seconds):  48.965215199998056


In [64]:
# see scores for the model
print('OVERALL F-SCORES: ')
print('Micro F1 Score: ' + str(f1_score(y_test, Y_pred_rf1, average = 'micro')))
print('Macro F1 Score: ' + str(f1_score(y_test, Y_pred_rf1, average = 'macro')))
print('Weighted F1 Score: ' + str(f1_score(y_test, Y_pred_rf1, average = 'weighted')))
print('Samples F1 Score: ' + str(f1_score(y_test, Y_pred_rf1, average = 'samples')))

for i in column_list:
    y_true = y_test.to_numpy()[:,column_list.index(i)]
    y_pred = Y_pred_rf1[:,column_list.index(i)]
    print(column_list[column_list.index(i)]+ ':')
    #text = [column_list[i], 'not']
    print(classification_report(y_true, y_pred) + '\n')

OVERALL F-SCORES: 
Micro F1 Score: 0.6095441985511938
Macro F1 Score: 0.23659088821858296
Weighted F1 Score: 0.5413497084558296
Samples F1 Score: 0.47606899887683196
related:
              precision    recall  f1-score   support

           0       0.66      0.46      0.54      1928
           1       0.84      0.92      0.88      5937

    accuracy                           0.81      7865
   macro avg       0.75      0.69      0.71      7865
weighted avg       0.80      0.81      0.80      7865


request:
              precision    recall  f1-score   support

           0       0.89      0.97      0.93      6511
           1       0.78      0.44      0.56      1354

    accuracy                           0.88      7865
   macro avg       0.84      0.71      0.75      7865
weighted avg       0.87      0.88      0.87      7865


offer:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      7832
           1       0.00      0.00      0.00 

The model performed worse. Will try one more time to find better parameters. 

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [65]:
# make set of parameter values to test in GridSearchCV
params = {  'clf__estimator__bootstrap': [True, False],
            'clf__estimator__n_estimators': [10, 20], 
            # for max features, 'auto' and 'sqrt' are the same
            'clf__estimator__max_features': ['log2','auto'],
            'clf__estimator__criterion': ['entropy', 'gini'], 
            #'clf__estimator__max_depth': [2, 3, 5, 10], 
            #'clf__estimator__min_samples_split': [2, 3, 5],
            #'clf__estimator__min_samples_leaf': [1,5,8] 
         }

random_forest_v2 = GridSearchCV(pipeline, param_grid=params, scoring='precision_samples', cv = None)

In [66]:
# train model and measure how long this takes
start = timeit.default_timer()

random_forest_v2.fit(X_train, y_train)

stop = timeit.default_timer()
print('Training time (in seconds): ', stop - start)

Training time (in seconds):  7171.3076019


In [67]:
start = timeit.default_timer()

Y_pred_rf2 = random_forest_v2.predict(X_test)

stop = timeit.default_timer()
print('Training time (in seconds): ', stop - start)

Training time (in seconds):  55.39805679999699


In [68]:
random_forest_v2.best_estimator_

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                 MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True,
                                                                        class_weight=None,
                                                                        criterion='gini',
                                                                  

In [69]:
# see scores for the model
print('OVERALL F-SCORES: ')
print('Micro F1 Score: ' + str(f1_score(y_test, Y_pred_rf2, average = 'micro')))
print('Macro F1 Score: ' + str(f1_score(y_test, Y_pred_rf2, average = 'macro')))
print('Weighted F1 Score: ' + str(f1_score(y_test, Y_pred_rf2, average = 'weighted')))
print('Samples F1 Score: ' + str(f1_score(y_test, Y_pred_rf2, average = 'samples')))

for i in column_list:
    y_true = y_test.to_numpy()[:,column_list.index(i)]
    y_pred = Y_pred_rf2[:,column_list.index(i)]
    print(column_list[column_list.index(i)]+ ':')
    #text = [column_list[i], 'not']
    print(classification_report(y_true, y_pred) + '\n')

OVERALL F-SCORES: 
Micro F1 Score: 0.5353935765714748
Macro F1 Score: 0.13866263895875613
Weighted F1 Score: 0.44468146400480496
Samples F1 Score: 0.45005375506355255
related:
              precision    recall  f1-score   support

           0       0.71      0.38      0.49      1928
           1       0.82      0.95      0.88      5937

    accuracy                           0.81      7865
   macro avg       0.77      0.66      0.69      7865
weighted avg       0.80      0.81      0.79      7865


request:
              precision    recall  f1-score   support

           0       0.88      0.98      0.93      6511
           1       0.84      0.37      0.51      1354

    accuracy                           0.88      7865
   macro avg       0.86      0.68      0.72      7865
weighted avg       0.87      0.88      0.86      7865


offer:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      7832
           1       0.00      0.00      0.00

Performed worse than both previous models
Stick with original training, and try different classficaction algorithms later.

In [70]:
models = pd.DataFrame(columns = ['Model', 'F1 Score (Micro)', 'F1 Score (Weighted)', 'Training Time (Seconds)'])
models = models.append({'Model': 'Random Forest', 
               'F1 Score (Micro)' : f1_score(y_test, Y_pred, average = 'micro'),
               'F1 Score (Weighted)' : f1_score(y_test, Y_pred, average = 'weighted'),
               'Training Time (Seconds)': 1149
              }, ignore_index = True)

In [71]:
# get f1 score for this model since it is using imbalanced ata
print('Micro F1 Score: ' + str(f1_score(y_test, Y_pred, average = 'micro')))
print('Macro F1 Score: ' + str(f1_score(y_test, Y_pred, average = 'macro')))
print('Weighted F1 Score: ' + str(f1_score(y_test, Y_pred, average = 'weighted')))
print('Samples F1 Score: ' + str(f1_score(y_test, Y_pred, average = 'samples')))

Micro F1 Score: 0.5944503197864733
Macro F1 Score: 0.23056060745281387
Weighted F1 Score: 0.5286867606863018
Samples F1 Score: 0.4647858210861778



### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF



Trying a linear classifier with a stochastic gradient descent  training

In [72]:
# dropping the child_alone column since it only has 0's
# and will not work in this classification algorithm
df2 = df.drop('child_alone', axis = 1)

X = df2['message']

# target variable is all columns that have category data
Y = df2.iloc[:,4:]

# split filtered data since altered dataset
X_train2, X_test2, y_train2, y_test2 = train_test_split(X, Y, test_size = .30, random_state = 1)

In [73]:
# make pipeline with SGD classifier

pipeline2 = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(SGDClassifier()))
])

start = timeit.default_timer()

pipeline2.fit(X_train2, y_train2)

stop = timeit.default_timer()
print('Training time (in seconds): ', stop - start)

Y_pred2 = pipeline2.predict(X_test2)


# get f1 score for this model since it is using imbalanced ata
print('Micro F1 Score: ' + str(f1_score(y_test2, Y_pred2, average = 'micro')))
print('Macro F1 Score: ' + str(f1_score(y_test2, Y_pred2, average = 'macro')))
print('Weighted F1 Score: ' + str(f1_score(y_test2, Y_pred2, average = 'weighted')))
print('Samples F1 Score: ' + str(f1_score(y_test2, Y_pred2, average = 'samples')))

Training time (in seconds):  106.50824439999997
Micro F1 Score: 0.6697575399172088
Macro F1 Score: 0.32973251688510047
Weighted F1 Score: 0.6073477961701148
Samples F1 Score: 0.519043199937892


In [74]:
models = models.append({'Model': 'SGD', 
               'F1 Score (Micro)' : f1_score(y_test2, Y_pred2, average = 'micro'),
               'F1 Score (Weighted)' : f1_score(y_test2, Y_pred2, average = 'weighted'),
               'Training Time (Seconds)' : 154
              }, ignore_index = True)

Now trying a logistic regression model

In [75]:
pipeline3 = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(LogisticRegression()))
])

start = timeit.default_timer()

pipeline3.fit(X_train2, y_train2)

stop = timeit.default_timer()
print('Training time (in seconds): ', stop - start)

Y_pred3 = pipeline3.predict(X_test2)

# get f1 score for this model since it is using imbalanced ata
print('Micro F1 Score: ' + str(f1_score(y_test2, Y_pred3, average = 'micro')))
print('Macro F1 Score: ' + str(f1_score(y_test2, Y_pred3, average = 'macro')))
print('Weighted F1 Score: ' + str(f1_score(y_test2, Y_pred3, average = 'weighted')))
print('Samples F1 Score: ' + str(f1_score(y_test2, Y_pred3, average = 'samples')))

Training time (in seconds):  108.8156525000013
Micro F1 Score: 0.6458867340523287
Macro F1 Score: 0.2881896124660017
Weighted F1 Score: 0.5854424461485186
Samples F1 Score: 0.5071035672299045


In [76]:
models = models.append({'Model': 'Logistic Regression', 
               'F1 Score (Micro)' : f1_score(y_test2, Y_pred3, average = 'micro'),
               'F1 Score (Weighted)' : f1_score(y_test2, Y_pred3, average = 'weighted'),
               'Training Time (Seconds)' : (stop-start)
              }, ignore_index = True)

In [77]:
models

Unnamed: 0,Model,F1 Score (Micro),F1 Score (Weighted),Training Time (Seconds)
0,Random Forest,0.59445,0.528687,1149.0
1,SGD,0.669758,0.607348,154.0
2,Logistic Regression,0.645887,0.585442,108.816


In [78]:
# Time difference between SGD and logistic regression
print('Difference = ' + str(round((models.iloc[1,3] / models.iloc[2,3]) - 1, 3)))

Difference = 0.415


The SGD model had the best micro f1 score. Although the logistic regression model took less time to train, it wasn't much faster than the SGD model. Will use the SGD model for message classification application.

However, this assignment requires implmenting GridSearchCV, so the best model using GridSearchCV needs to be used. I will use GridSearchCV with the SGD model.

In [79]:
# only get params for classifier
[i for i in list(pipeline2.get_params().keys()) if i.startswith('clf_')]

['clf__estimator__alpha',
 'clf__estimator__average',
 'clf__estimator__class_weight',
 'clf__estimator__early_stopping',
 'clf__estimator__epsilon',
 'clf__estimator__eta0',
 'clf__estimator__fit_intercept',
 'clf__estimator__l1_ratio',
 'clf__estimator__learning_rate',
 'clf__estimator__loss',
 'clf__estimator__max_iter',
 'clf__estimator__n_iter_no_change',
 'clf__estimator__n_jobs',
 'clf__estimator__penalty',
 'clf__estimator__power_t',
 'clf__estimator__random_state',
 'clf__estimator__shuffle',
 'clf__estimator__tol',
 'clf__estimator__validation_fraction',
 'clf__estimator__verbose',
 'clf__estimator__warm_start',
 'clf__estimator',
 'clf__n_jobs']

In [39]:
# make set of parameter values to test in GridSearchCV
# only chose a few parameters
params = {   #'clf__estimator__alpha',
             #'clf__estimator__average': [True, False],
             #'clf__estimator__class_weight',
             #'clf__estimator__early_stopping' : [True, False],
             #'clf__estimator__epsilon',
             'clf__estimator__eta0': [0.01],
             #'clf__estimator__fit_intercept': [True, False],
             #'clf__estimator__l1_ratio',
             'clf__estimator__learning_rate': ['constant', 'optimal', 'invscaling', 'adaptive'],
             'clf__estimator__loss': ['hinge', 'log', 'modified_huber', 'squared_hinge', 'perceptron', \
                              'squared_loss', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive'],
             #'clf__estimator__max_iter',
             #'clf__estimator__n_iter_no_change': [5, 10],
             #'clf__estimator__n_jobs',
             'clf__estimator__penalty': ['l2', 'l1', 'elasticnet'],
             #'clf__estimator__power_t',
             'clf__estimator__random_state': [50]
             #'clf__estimator__shuffle': [True, False],
             #'clf__estimator__tol',
             #'clf__estimator__validation_fraction',
             #'clf__estimator__verbose',
             #'clf__estimator__warm_start': [True, False]
         }

# size of grid
pg = ParameterGrid(params)
print(len(pg))

sgd_grid = GridSearchCV(pipeline2, param_grid=params, scoring='precision_samples', cv = None, n_jobs = -1)

108


In [40]:
# train model and measure how long this takes
start = timeit.default_timer()

sgd_grid.fit(X_train2, y_train2)

stop = timeit.default_timer()
print('Training time (in seconds): ', stop - start)

# 20,240 seconds (5.6 hours) for a grid size of 108 

Training time (in seconds):  20240.4545716


In [41]:
sgd_grid.best_estimator_

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                 MultiOutputClassifier(estimator=SGDClassifier(alpha=0.0001,
                                                               average=False,
                                                               class_weight=None,
                                                               early_stopping=False,
             

In [42]:
start = timeit.default_timer()

Y_pred_sgd_grid = sgd_grid.predict(X_test2)

stop = timeit.default_timer()
print('Validation time (in seconds): ', stop - start)

Validation time (in seconds):  49.60978360000445


In [44]:
# see scores for the model
print('OVERALL F-SCORES: ')
print('Micro F1 Score: ' + str(f1_score(y_test2, Y_pred_sgd_grid, average = 'micro')))
print('Macro F1 Score: ' + str(f1_score(y_test2, Y_pred_sgd_grid, average = 'macro')))
print('Weighted F1 Score: ' + str(f1_score(y_test2, Y_pred_sgd_grid, average = 'weighted')))
print('Samples F1 Score: ' + str(f1_score(y_test2, Y_pred_sgd_grid, average = 'samples')))

column_list = list(Y.columns)
for i in column_list:
    y_true = y_test2.to_numpy()[:,column_list.index(i)]
    y_pred = Y_pred_sgd_grid[:,column_list.index(i)]
    print(column_list[column_list.index(i)]+ ':')
    #text = [column_list[i], 'not']
    print(classification_report(y_true, y_pred) + '\n')

OVERALL F-SCORES: 
Micro F1 Score: 0.3927000356675782
Macro F1 Score: 0.03439432261131815
Weighted F1 Score: 0.24826468424956508
Samples F1 Score: 0.40817995785946215
related:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00      1928
           1       0.75      1.00      0.86      5937

    accuracy                           0.75      7865
   macro avg       0.38      0.50      0.43      7865
weighted avg       0.57      0.75      0.65      7865


request:
              precision    recall  f1-score   support

           0       0.83      1.00      0.91      6511
           1       0.00      0.00      0.00      1354

    accuracy                           0.83      7865
   macro avg       0.41      0.50      0.45      7865
weighted avg       0.69      0.83      0.75      7865


offer:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      7832
           1       0.00      0.00      0.00

  'precision', 'predicted', average, warn_for)



medical_help:
              precision    recall  f1-score   support

           0       0.92      1.00      0.96      7199
           1       0.00      0.00      0.00       666

    accuracy                           0.92      7865
   macro avg       0.46      0.50      0.48      7865
weighted avg       0.84      0.92      0.87      7865


medical_products:
              precision    recall  f1-score   support

           0       0.95      1.00      0.97      7447
           1       0.00      0.00      0.00       418

    accuracy                           0.95      7865
   macro avg       0.47      0.50      0.49      7865
weighted avg       0.90      0.95      0.92      7865


search_and_rescue:
              precision    recall  f1-score   support

           0       0.97      1.00      0.99      7657
           1       0.00      0.00      0.00       208

    accuracy                           0.97      7865
   macro avg       0.49      0.50      0.49      7865
weighted avg       0

In [None]:
# add to model comparison dataframe
models = models.append({'Model': 'SGD w/GridSearchCV', 
               'F1 Score (Micro)' : f1_score(y_test2, Y_pred_sgd_grid, average = 'micro'),
               'F1 Score (Weighted)' : f1_score(y_test2, Y_pred_sgd_grid, average = 'weighted'),
               'Training Time (Seconds)' : (stop-start)
              }, ignore_index = True)

### 9. Export your model as a pickle file

The SGD model using GridSearchCV did not perform as well as the RandomForest using GridSearchCV.

Due to requirement of project to implement a model that uses GridSearchCV, the model that uses Random Forest and GridSearchCV will be exported

In [81]:
dump(random_forest_v2, open('classifier.pkl', 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.