# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [6]:
 #import libraries
import pandas as pd
import numpy as np
import os
import pickle
from sqlalchemy import create_engine
import re
import nltk
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, make_scorer
import xgboost as xgb
from xgboost import XGBClassifier

nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger'])

import warnings

warnings.simplefilter('ignore')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\okekec\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\okekec\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\okekec\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [7]:
# load data from database
engine = create_engine('sqlite:///Project3.db')
df = pd.read_sql("SELECT * FROM df", engine)
X = df['message']
y = df.iloc[:, 4:]



### 2. Write a tokenization function to process your text data

In [8]:
def tokenize(text):
    # Convert text to lowercase and remove punctuation
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [10]:
#using MultiOutputClassifier to build an ML pipeline
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [11]:
#training the model with a random seed of 42
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 1) 
np.random.seed(42)
pipeline.fit(X_train, y_train)


Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [12]:
def get_eval_metrics(actual, predicted, col_names):
    """Calculate evaluation metrics for ML model
    
    Args:
    actual: array. Array containing actual labels.
    predicted: array. Array containing predicted labels.
    col_names: list of strings. List containing names for each of the predicted fields.
       
    Returns:
    metrics_df: dataframe. Dataframe containing the accuracy, precision, recall 
    and f1 score for a given set of actual and predicted labels.
    """
    metrics = []
    
    # Calculate evaluation metrics for each set of labels
    for i in range(len(col_names)):
        accuracy = accuracy_score(actual[:, i], predicted[:, i])
        precision = precision_score(actual[:, i], predicted[:, i])
        recall = recall_score(actual[:, i], predicted[:, i])
        f1 = f1_score(actual[:, i], predicted[:, i])
        
        metrics.append([accuracy, precision, recall, f1])
    
    # Create dataframe containing metrics
    metrics = np.array(metrics)
    metrics_df = pd.DataFrame(data = metrics, index = col_names, columns = ['Accuracy', 'Precision', 'Recall', 'F1'])
      
    return metrics_df

In [14]:
# Calculate evaluation metrics for training set
y_train_pred = pipeline.predict(X_train)
col_names = list(y.columns.values)

print(get_eval_metrics(np.array(y_train), y_train_pred, col_names))

                        Accuracy  Precision    Recall        F1
request                 0.987285   0.998707  0.926237  0.961108
offer                   0.998678   1.000000  0.717391  0.835443
aid_related             0.986370   0.997088  0.969831  0.983271
medical_help            0.986878   0.999216  0.832136  0.908054
medical_products        0.993337   0.998802  0.865145  0.927182
search_and_rescue       0.993897   0.997722  0.786355  0.879518
security                0.995067   1.000000  0.729805  0.843800
military                0.995830   0.998305  0.879104  0.934921
child_alone             1.000000   0.000000  0.000000  0.000000
water                   0.994253   0.999116  0.909823  0.952381
food                    0.993287   1.000000  0.939142  0.968616
shelter                 0.991150   0.999361  0.900460  0.947337
clothing                0.996999   1.000000  0.807190  0.893309
money                   0.995219   1.000000  0.796976  0.887019
missing_people          0.998016   1.000

In [15]:
# Calculate evaluation metrics for test set
y_test_pred = pipeline.predict(X_test)

eval_metrics1 = get_eval_metrics(np.array(y_test), y_test_pred, col_names)
print(eval_metrics1)

                        Accuracy  Precision    Recall        F1
request                 0.879921   0.843750  0.379280  0.523319
offer                   0.996033   0.000000  0.000000  0.000000
aid_related             0.728410   0.764771  0.505659  0.608791
medical_help            0.918370   0.680000  0.061483  0.112769
medical_products        0.949954   0.783784  0.083095  0.150259
search_and_rescue       0.974519   0.500000  0.041916  0.077348
security                0.983064   1.000000  0.008929  0.017699
military                0.970705   0.458333  0.057895  0.102804
child_alone             1.000000   0.000000  0.000000  0.000000
water                   0.945072   0.824074  0.206977  0.330855
food                    0.915319   0.861818  0.314324  0.460641
shelter                 0.927525   0.834437  0.218750  0.346630
clothing                0.984590   0.400000  0.040404  0.073394
money                   0.978181   0.333333  0.014184  0.027211
missing_people          0.988251   1.000

The training accuracy and F1 score perform well. However when the model sees an out of sample data from the test dataset, it performs poorly. The accuracy scores are high but the F1 score is very low. This is likely due to the unbalanced nature of the dataset.

In [10]:
# Calculating the proportion of each column that have label == 1
y.sum()/len(y)

request                   0.170659
offer                     0.004501
aid_related               0.414251
medical_help              0.079493
medical_products          0.050084
search_and_rescue         0.027617
security                  0.017966
military                  0.032804
child_alone               0.000000
water                     0.063778
food                      0.111497
shelter                   0.088267
clothing                  0.015449
money                     0.023039
missing_people            0.011367
refugees                  0.033377
death                     0.045545
other_aid                 0.131446
infrastructure_related    0.065037
transport                 0.045812
buildings                 0.050847
electricity               0.020293
tools                     0.006065
hospitals                 0.010795
shops                     0.004577
aid_centers               0.011787
other_infrastructure      0.043904
weather_related           0.278341
floods              

After calculating the proportion of each column, we can see that the data is unbalanced. Some of the categories are less than 5% of the dataset. This makes it difficult to predict. In the next session, we will apply a method to deal with the imbalanced dataset to improve the model.

### 6. Improve your model
Use grid search to find better parameters. 

In [16]:
# Define performance metric for used in the grid search 
def performance_metric(y_true, y_pred):
    """Calculate median F1 score for all of the output classifiers
    
    Args:
    y_true: array. Array containing actual labels.
    y_pred: array. Array containing predicted labels.
        
    Returns:
    score: float. Median F1 score for all of the output classifiers
    """
    f1_list = []
    for i in range(np.shape(y_pred)[1]):
        f1 = f1_score(np.array(y_true)[:, i], y_pred[:, i])
        f1_list.append(f1)
        
    score = np.median(f1_list)
    return score

We have choose use the median F1 score for the output classifiers instead of the mean. This is to prevent the uneven distribution of the F1 score taking into account extreme values and zeros.

In [40]:
#Modify your model to return a GridSearchCV object.
parameters = {'vect__max_df': [0.75, 1.0],
              'tfidf__use_idf':[True, False],
              'clf__estimator__n_estimators':[10, 25], 
              'clf__estimator__min_samples_split':[2, 5, 10]}

scorer = make_scorer(performance_metric)
cv = GridSearchCV(pipeline, param_grid = parameters, scoring = scorer, verbose = 2)

# Find best parameters
np.random.seed(42)
tuned_model = cv.fit(X_train, y_train)

Fitting 3 folds for each of 24 candidates, totalling 72 fits
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=0.75 
[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=0.75, total=  44.4s
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=0.75 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   51.9s remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=0.75, total=  46.0s
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=0.75 
[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=0.75, total=  47.6s
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=1.0 
[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=1.0, total=  43.9s
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=1.0 
[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=1.0, total=  42.6s
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=1.0 
[CV]  clf__estimator__min

[CV]  clf__estimator__min_samples_split=5, clf__estimator__n_estimators=10, tfidf__use_idf=False, vect__max_df=1.0, total=  32.2s
[CV] clf__estimator__min_samples_split=5, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__max_df=0.75 
[CV]  clf__estimator__min_samples_split=5, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__max_df=0.75, total= 1.2min
[CV] clf__estimator__min_samples_split=5, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__max_df=0.75 
[CV]  clf__estimator__min_samples_split=5, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__max_df=0.75, total= 1.3min
[CV] clf__estimator__min_samples_split=5, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__max_df=0.75 
[CV]  clf__estimator__min_samples_split=5, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__max_df=0.75, total= 1.2min
[CV] clf__estimator__min_samples_split=5, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__max_df=1.0 
[CV]  clf__estimator_

[CV]  clf__estimator__min_samples_split=10, clf__estimator__n_estimators=25, tfidf__use_idf=False, vect__max_df=1.0, total=  59.4s
[CV] clf__estimator__min_samples_split=10, clf__estimator__n_estimators=25, tfidf__use_idf=False, vect__max_df=1.0 
[CV]  clf__estimator__min_samples_split=10, clf__estimator__n_estimators=25, tfidf__use_idf=False, vect__max_df=1.0, total=  59.6s
[CV] clf__estimator__min_samples_split=10, clf__estimator__n_estimators=25, tfidf__use_idf=False, vect__max_df=1.0 
[CV]  clf__estimator__min_samples_split=10, clf__estimator__n_estimators=25, tfidf__use_idf=False, vect__max_df=1.0, total=  59.1s


[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed: 80.1min finished


In [13]:
# Get results of grid search
tuned_model.cv_results_

{'mean_fit_time': array([ 37.4043591 ,  37.37849339,  37.58101217,  37.34998488,
         84.4230295 ,  83.04098415,  83.26359765,  82.90529323,
         31.84374746,  31.7137289 ,  30.90857363,  30.73453569,
         68.86535915,  69.76065588,  67.97183466,  68.35721795,
         29.58434391,  29.4461933 ,  28.67922926,  28.55971217,
         63.64330308,  62.81008021,  60.54885459,  61.31373405]),
 'std_fit_time': array([ 0.12170782,  0.13465771,  0.09260326,  0.15678535,  0.12696913,
         0.11132765,  0.05645151,  0.0722737 ,  0.30331126,  0.26530105,
         0.11190421,  0.16644813,  0.15746703,  0.26548985,  0.14486467,
         0.13941127,  0.10647157,  0.07907105,  0.17734197,  0.08149233,
         0.32487823,  0.3411261 ,  0.21035485,  0.2412044 ]),
 'mean_score_time': array([ 4.57505337,  4.54153554,  4.55048974,  4.51354798,  6.64433058,
         6.57727273,  6.51987433,  6.48522051,  4.57555636,  4.49807819,
         4.47716482,  4.45687604,  6.55812836,  6.56139787,  6

In [14]:
# Best mean test score
np.max(tuned_model.cv_results_['mean_test_score'])

0.1424982491782093

In [15]:
# Parameters for best mean test score
tuned_model.best_params_

{'clf__estimator__min_samples_split': 5,
 'clf__estimator__n_estimators': 10,
 'tfidf__use_idf': True,
 'vect__max_df': 1.0}

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [17]:
# Calculate evaluation metrics for test set
tuned_pred_test = tuned_model.predict(X_test)

eval_metrics2 = get_eval_metrics(np.array(y_test), tuned_pred_test, col_names)

print(eval_metrics2)

                        Accuracy  Precision    Recall        F1
request                 0.883735   0.810544  0.431958  0.563574
offer                   0.996033   0.000000  0.000000  0.000000
aid_related             0.737412   0.729486  0.590727  0.652814
medical_help            0.918370   0.593750  0.103074  0.175655
medical_products        0.950107   0.843750  0.077364  0.141732
search_and_rescue       0.975435   0.687500  0.065868  0.120219
security                0.981996   0.125000  0.008929  0.016667
military                0.970857   0.480000  0.063158  0.111628
child_alone             1.000000   0.000000  0.000000  0.000000
water                   0.946445   0.898990  0.206977  0.336484
food                    0.918828   0.824561  0.374005  0.514599
shelter                 0.930424   0.812500  0.270833  0.406250
clothing                0.987031   0.750000  0.212121  0.330709
money                   0.978639   0.555556  0.035461  0.066667
missing_people          0.987946   1.000

In [18]:
# Get summary stats for first model
eval_metrics1.describe()

Unnamed: 0,Accuracy,Precision,Recall,F1
count,37.0,37.0,37.0,37.0
mean,0.945789,0.566834,0.172575,0.226513
std,0.055346,0.341899,0.248865,0.272454
min,0.72841,0.0,0.0,0.0
25%,0.93134,0.333333,0.008929,0.017699
50%,0.95621,0.68,0.059701,0.108696
75%,0.983064,0.834437,0.21875,0.34663
max,1.0,1.0,0.92839,0.951134


In [19]:
# Get summary stats for tuned model
eval_metrics2.describe()

Unnamed: 0,Accuracy,Precision,Recall,F1
count,37.0,37.0,37.0,37.0
mean,0.947649,0.618904,0.204921,0.264554
std,0.052897,0.327372,0.258303,0.278661
min,0.737412,0.0,0.0,0.0
25%,0.935459,0.515152,0.013333,0.026316
50%,0.958651,0.733333,0.094017,0.166667
75%,0.983216,0.824561,0.368984,0.498195
max,1.0,1.0,0.945761,0.957249


We can see that the mean and median F1 score is higher after using grid search. However, the F1 score is very low. We can try to improve the model further using another machine learning algorithm called XGBoost.

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [22]:
# Try using XGBoost instead of Random Forest Classifier
pipeline2 = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(XGBClassifier()))
])

parameters2 = {'vect__max_df': [1],
              'tfidf__use_idf':[True],
              'clf__estimator__n_estimators':[10,50, 100], 
              'clf__estimator__learning_rate': [0.05, 0.1],
              'clf__estimator__max_depth': [3, 4]}
scorer = make_scorer(performance_metric)
cv2 = GridSearchCV(pipeline2, param_grid = parameters2, scoring = scorer, verbose = 2)

# Find best parameters
np.random.seed(42)
tuned_model2 = cv2.fit(X_train, y_train)

Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] clf__estimator__learning_rate=0.05, clf__estimator__max_depth=3, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=1 
[CV]  clf__estimator__learning_rate=0.05, clf__estimator__max_depth=3, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=1, total=  12.8s
[CV] clf__estimator__learning_rate=0.05, clf__estimator__max_depth=3, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=1 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   19.4s remaining:    0.0s


[CV]  clf__estimator__learning_rate=0.05, clf__estimator__max_depth=3, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=1, total=  13.5s
[CV] clf__estimator__learning_rate=0.05, clf__estimator__max_depth=3, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=1 
[CV]  clf__estimator__learning_rate=0.05, clf__estimator__max_depth=3, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=1, total=  12.2s
[CV] clf__estimator__learning_rate=0.05, clf__estimator__max_depth=3, clf__estimator__n_estimators=50, tfidf__use_idf=True, vect__max_df=1 
[CV]  clf__estimator__learning_rate=0.05, clf__estimator__max_depth=3, clf__estimator__n_estimators=50, tfidf__use_idf=True, vect__max_df=1, total=  27.2s
[CV] clf__estimator__learning_rate=0.05, clf__estimator__max_depth=3, clf__estimator__n_estimators=50, tfidf__use_idf=True, vect__max_df=1 
[CV]  clf__estimator__learning_rate=0.05, clf__estimator__max_depth=3, clf__estimator__n_estimators=50, tfidf__use_

[CV]  clf__estimator__learning_rate=0.1, clf__estimator__max_depth=4, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=1, total=  11.8s
[CV] clf__estimator__learning_rate=0.1, clf__estimator__max_depth=4, clf__estimator__n_estimators=50, tfidf__use_idf=True, vect__max_df=1 
[CV]  clf__estimator__learning_rate=0.1, clf__estimator__max_depth=4, clf__estimator__n_estimators=50, tfidf__use_idf=True, vect__max_df=1, total=  27.0s
[CV] clf__estimator__learning_rate=0.1, clf__estimator__max_depth=4, clf__estimator__n_estimators=50, tfidf__use_idf=True, vect__max_df=1 
[CV]  clf__estimator__learning_rate=0.1, clf__estimator__max_depth=4, clf__estimator__n_estimators=50, tfidf__use_idf=True, vect__max_df=1, total=  28.5s
[CV] clf__estimator__learning_rate=0.1, clf__estimator__max_depth=4, clf__estimator__n_estimators=50, tfidf__use_idf=True, vect__max_df=1 
[CV]  clf__estimator__learning_rate=0.1, clf__estimator__max_depth=4, clf__estimator__n_estimators=50, tfidf__use_idf=Tru

[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed: 20.9min finished


In [23]:
# Get results of grid search
tuned_model2.cv_results_

{'mean_fit_time': array([  9.96029933,  24.98443842,  43.67744033,   9.46175329,
         25.04541747,  43.27202376,   9.3447272 ,  24.83415786,
         43.35273353,   9.28449225,  25.23044475,  47.60735393]),
 'mean_score_time': array([ 2.94934138,  2.83871754,  2.83432174,  2.7868286 ,  2.80853271,
         2.77955286,  2.83252009,  2.82921894,  2.81801248,  2.75331807,
         2.82940356,  3.05417212]),
 'mean_test_score': array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]),
 'mean_train_score': array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]),
 'param_clf__estimator__learning_rate': masked_array(data = [0.05 0.05 0.05 0.05 0.05 0.05 0.1 0.1 0.1 0.1 0.1 0.1],
              mask = [False False False False False False False False False False False False],
        fill_value = ?),
 'param_clf__estimator__max_depth': masked_array(data = [3 3 3 4 4 4 3 3 3 4 4 4],
              mask = [False False False False False False False False False False False 

In [25]:
# Calculate evaluation metrics for test set
tuned_pred_test2 = tuned_model2.predict(X_test)

eval_metrics2 = get_eval_metrics(np.array(y_test), tuned_pred_test2, col_names)

print(eval_metrics2)

                        Accuracy  Precision  Recall        F1
request                 0.826213   0.000000     0.0  0.000000
offer                   0.996033   0.000000     0.0  0.000000
aid_related             0.582087   0.000000     0.0  0.000000
medical_help            0.915624   0.000000     0.0  0.000000
medical_products        0.946750   0.000000     0.0  0.000000
search_and_rescue       0.974519   0.000000     0.0  0.000000
security                0.982911   0.000000     0.0  0.000000
military                0.971010   0.000000     0.0  0.000000
child_alone             1.000000   0.000000     0.0  0.000000
water                   0.934391   0.000000     0.0  0.000000
food                    0.884956   0.000000     0.0  0.000000
shelter                 0.912115   0.000000     0.0  0.000000
clothing                0.984895   0.000000     0.0  0.000000
money                   0.978486   0.000000     0.0  0.000000
missing_people          0.987794   0.000000     0.0  0.000000
refugees

In [29]:
# Parameters for best mean test score
tuned_model2.best_params_

{'clf__estimator__learning_rate': 0.05,
 'clf__estimator__max_depth': 3,
 'clf__estimator__n_estimators': 10,
 'tfidf__use_idf': True,
 'vect__max_df': 1}

The XGBoost model performs poorly with only the news column having an F1 score of 0.65. The poor performance could be attributed to the imbalanced targets. We are going to use the imblearn package, which provides the ability to use simple approaches to balance the sample numbers in classes without introducing new features.

In [27]:
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from collections import Counter
from imblearn.over_sampling import SMOTE  
from imblearn.pipeline import Pipeline as imbPipeline

The imblearn package is not compatible with the sklearn Pipelines, therefore we have to use the imblearn pipeline. 

In [34]:
# Using SMOTE along with the best XGBoost parameter to deal with the imbalanced data
sm = SMOTE(random_state=42)
pipeline3 = imbPipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(XGBClassifier()))
])

parameters3 = {'vect__max_df': [1],
              'tfidf__use_idf':[True],
              'clf__estimator__n_estimators':[10,100], 
              'clf__estimator__learning_rate': [0.05],
              'clf__estimator__max_depth': [3]}
scorer = make_scorer(performance_metric)
cv3 = GridSearchCV(pipeline3, param_grid = parameters3, scoring = scorer, verbose = 2)
# fitting the model
tuned_model3 = cv3.fit(X_train, y_train)


Fitting 3 folds for each of 2 candidates, totalling 6 fits
[CV] clf__estimator__learning_rate=0.05, clf__estimator__max_depth=3, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=1 
[CV]  clf__estimator__learning_rate=0.05, clf__estimator__max_depth=3, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=1, total=  12.0s
[CV] clf__estimator__learning_rate=0.05, clf__estimator__max_depth=3, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=1 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   17.2s remaining:    0.0s


[CV]  clf__estimator__learning_rate=0.05, clf__estimator__max_depth=3, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=1, total=  11.7s
[CV] clf__estimator__learning_rate=0.05, clf__estimator__max_depth=3, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=1 
[CV]  clf__estimator__learning_rate=0.05, clf__estimator__max_depth=3, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__max_df=1, total=  12.8s
[CV] clf__estimator__learning_rate=0.05, clf__estimator__max_depth=3, clf__estimator__n_estimators=100, tfidf__use_idf=True, vect__max_df=1 
[CV]  clf__estimator__learning_rate=0.05, clf__estimator__max_depth=3, clf__estimator__n_estimators=100, tfidf__use_idf=True, vect__max_df=1, total=  47.5s
[CV] clf__estimator__learning_rate=0.05, clf__estimator__max_depth=3, clf__estimator__n_estimators=100, tfidf__use_idf=True, vect__max_df=1 
[CV]  clf__estimator__learning_rate=0.05, clf__estimator__max_depth=3, clf__estimator__n_estimators=100, tfidf__

[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  3.5min finished


In [35]:
# Get results of grid search
tuned_model3.cv_results_

{'mean_fit_time': array([  9.47085317,  43.48959867]),
 'mean_score_time': array([ 2.76971308,  3.18081522]),
 'mean_test_score': array([ 0.,  0.]),
 'mean_train_score': array([ 0.,  0.]),
 'param_clf__estimator__learning_rate': masked_array(data = [0.05 0.05],
              mask = [False False],
        fill_value = ?),
 'param_clf__estimator__max_depth': masked_array(data = [3 3],
              mask = [False False],
        fill_value = ?),
 'param_clf__estimator__n_estimators': masked_array(data = [10 100],
              mask = [False False],
        fill_value = ?),
 'param_tfidf__use_idf': masked_array(data = [True True],
              mask = [False False],
        fill_value = ?),
 'param_vect__max_df': masked_array(data = [1 1],
              mask = [False False],
        fill_value = ?),
 'params': [{'clf__estimator__learning_rate': 0.05,
   'clf__estimator__max_depth': 3,
   'clf__estimator__n_estimators': 10,
   'tfidf__use_idf': True,
   'vect__max_df': 1},
  {'clf__estimato

In [36]:
# Calculate evaluation metrics for test set
tuned_pred_test3 = tuned_model3.predict(X_test)

eval_metrics3 = get_eval_metrics(np.array(y_test), tuned_pred_test3, col_names)

print(eval_metrics3)

                        Accuracy  Precision  Recall        F1
request                 0.826213   0.000000     0.0  0.000000
offer                   0.996033   0.000000     0.0  0.000000
aid_related             0.582087   0.000000     0.0  0.000000
medical_help            0.915624   0.000000     0.0  0.000000
medical_products        0.946750   0.000000     0.0  0.000000
search_and_rescue       0.974519   0.000000     0.0  0.000000
security                0.982911   0.000000     0.0  0.000000
military                0.971010   0.000000     0.0  0.000000
child_alone             1.000000   0.000000     0.0  0.000000
water                   0.934391   0.000000     0.0  0.000000
food                    0.884956   0.000000     0.0  0.000000
shelter                 0.912115   0.000000     0.0  0.000000
clothing                0.984895   0.000000     0.0  0.000000
money                   0.978486   0.000000     0.0  0.000000
missing_people          0.987794   0.000000     0.0  0.000000
refugees

Using the imblearn package and SMOTE method made no difference in the model. Hence we settle for the tuned original model which is the Random Forest Classifier. Below are the best parameters after tuning the model:

- CountVectorizer maximum df = 1
- TfidfTransformer use_idf = True
- Random Forest Classifier number of estimators = 10
- Random Forest Classifier minimum samples split = 5


### 9. Export your model as a pickle file

In [41]:
# Pickle the tuned model
pickle.dump(tuned_model, open('disaster_model.sav', 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.