# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries

import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import re
import pickle
import nltk

nltk.download('punkt')
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, fbeta_score, make_scorer

import warnings

warnings.simplefilter('ignore')

[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed
[nltk_data]     (_ssl.c:852)>
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed
[nltk_data]     (_ssl.c:852)>


In [2]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [None]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')

df = pd.read_sql("SELECT * FROM Messages", engine)

In [3]:
X = df['message']

In [4]:
Y = df.iloc[:,4:]
Y.shape

(26216, 35)

### 2. Write a tokenization function to process your text data

In [5]:
def tokenize(text):
    # Convert text to lowercase and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", ' ', text.lower())
    # Tokenize
    words = word_tokenize(text)
    # Remove Stopwords
    stop_words = stopwords.words("english")
    words = [w for w in words if w not in stop_words]
    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    clean = [lemmatizer.lemmatize(w, pos='n').strip() for w in words]
    clean = [lemmatizer.lemmatize(w, pos='v').strip() for w in clean]
    
    
    return clean


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [6]:
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [7]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state = 17)
pipeline.fit(X_train, Y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                 MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True,
                                                                        class_weight=None,
                                                                        criterion='gini',
                                                                  

In [8]:
X_test

16750    According to Government estimates, as many as ...
10245    Craziness in Haiti in aftermath of major earth...
25610    A source close to the donors said the army bat...
15095    Objective 2 - to reduce morbidity and mortalit...
21085    The three people include; Mrs. Shallah Mahaind...
                               ...                        
2784     Do you have any mosquito net. Our house collap...
13419    NORTH India: isolated rain thundershowers over...
23092    In the context of the tsunami disaster, the ke...
12609    I didn't even have to drive. The wind blew my ...
19682    Triggered by conflict over the villages' sea b...
Name: message, Length: 6554, dtype: object

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [None]:
def evaluate_model(actual, predicted, col_names):

    metrics = []
    
    # Calculate evaluation metrics for each set of labels
    for i in range(Y_test.shape[1]):
        accuracy = accuracy_score(actual[:, i], predicted[:, i])
        precision = precision_score(actual[:, i], predicted[:, i], average="micro")
        recall = recall_score(actual[:, i], predicted[:, i], average="micro")
        f1 = f1_score(actual[:, i], predicted[:, i],average="micro")
        
        metrics.append([accuracy, precision, recall, f1])
    
    # Create dataframe containing metrics
    metrics = np.array(metrics)
    metrics_df = pd.DataFrame(data = metrics, index = col_names, columns = ['Accuracy', 'Precision', 'Recall', 'F1'])
      
    return metrics_df

In [33]:
predicted =Y_train_pred
actual = np.array(Y_train)
accuracy = accuracy_score(actual[:, 1], predicted[:, 1])
precision = precision_score(actual[:, 1], predicted[:, 1], average="micro")
recall = recall_score(actual[:, 1], predicted[:, 1], average="micro")
f1 = f1_score(actual[:, 1], predicted[:, 1],average="micro")
print (accuracy, precision, recall,f1)

0.9858101922490082 0.9858101922490082 0.9858101922490082 0.9858101922490082


In [10]:
# Calculate evaluation metrics for training set
Y_train_pred = pipeline.predict(X_train)
col_names = list(Y.columns.values)

print(evaluate_model(np.array(Y_train), Y_train_pred, col_names))

                        Accuracy  Precision    Recall        F1
related                 0.989167   0.989167  0.989167  0.989167
request                 0.985810   0.985810  0.985810  0.985810
offer                   0.999085   0.999085  0.999085  0.999085
aid_related             0.984844   0.984844  0.984844  0.984844
medical_help            0.987489   0.987489  0.987489  0.987489
medical_products        0.992676   0.992676  0.992676  0.992676
search_and_rescue       0.993999   0.993999  0.993999  0.993999
security                0.995880   0.995880  0.995880  0.995880
military                0.996135   0.996135  0.996135  0.996135
water                   0.995117   0.995117  0.995117  0.995117
food                    0.994304   0.994304  0.994304  0.994304
shelter                 0.991608   0.991608  0.991608  0.991608
clothing                0.997152   0.997152  0.997152  0.997152
money                   0.995219   0.995219  0.995219  0.995219
missing_people          0.997813   0.997

In [11]:
# Calculate evaluation metrics for test set
Y_test_pred = pipeline.predict(X_test)

testing_metrics = evaluate_model(np.array(Y_test), Y_test_pred, col_names)
print(testing_metrics)

                        Accuracy  Precision    Recall        F1
related                 0.810650   0.810650  0.810650  0.810650
request                 0.883125   0.883125  0.883125  0.883125
offer                   0.995423   0.995423  0.995423  0.995423
aid_related             0.752212   0.752212  0.752212  0.752212
medical_help            0.924321   0.924321  0.924321  0.924321
medical_products        0.949344   0.949344  0.949344  0.949344
search_and_rescue       0.972994   0.972994  0.972994  0.972994
security                0.981996   0.981996  0.981996  0.981996
military                0.967501   0.967501  0.967501  0.967501
water                   0.954379   0.954379  0.954379  0.954379
food                    0.937443   0.937443  0.937443  0.937443
shelter                 0.924779   0.924779  0.924779  0.924779
clothing                0.983674   0.983674  0.983674  0.983674
money                   0.977724   0.977724  0.977724  0.977724
missing_people          0.989472   0.989

In [12]:
# Calculation the proportion of each column that have label == 1
Y.sum()/len(Y)

related                   0.773650
request                   0.170659
offer                     0.004501
aid_related               0.414251
medical_help              0.079493
medical_products          0.050084
search_and_rescue         0.027617
security                  0.017966
military                  0.032804
water                     0.063778
food                      0.111497
shelter                   0.088267
clothing                  0.015449
money                     0.023039
missing_people            0.011367
refugees                  0.033377
death                     0.045545
other_aid                 0.131446
infrastructure_related    0.065037
transport                 0.045812
buildings                 0.050847
electricity               0.020293
tools                     0.006065
hospitals                 0.010795
shops                     0.004577
aid_centers               0.011787
other_infrastructure      0.043904
weather_related           0.278341
floods              

### 6. Improve your model
Use grid search to find better parameters. 

In [13]:
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                   dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                   lowercase=True, max_df=1.0, max_features=None, min_df=1,
                   ngram_range=(1, 1), preprocessor=None, stop_words=None,
                   strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                   tokenizer=<function tokenize at 0x1462c0bf8>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True,
                                                          class_weight=None,
                                                          criterion='gini',
                                                          max_depth=None,
                                                          max_features='auto',
            

In [14]:
# Define performance metric for use in grid search scoring object
def performance_metric(y_true, y_pred):
    """Calculate median F1 score for all of the output classifiers
    
    Args:
    y_true: array. Array containing actual labels.
    y_pred: array. Array containing predicted labels.
        
    Returns:
    score: float. Median F1 score for all of the output classifiers
    """
    f1_list = []
    for i in range(np.shape(y_pred)[1]):
        f1 = f1_score(np.array(y_true)[:, i], y_pred[:, i],average='micro')
        f1_list.append(f1)
        
    score = np.median(f1_list)
    return score

In [39]:
parameters = {
#     'vect__min_df': [1, 5],
#               'tfidf__use_idf':[True, False],
               'clf__estimator__n_estimators': [10,50],
              'clf__estimator__min_samples_split': [2, 4]
              }
# scorer = make_scorer(fbeta_score, beta=2)
# scorer = make_scorer(performance_metric)
cv = GridSearchCV(pipeline, param_grid=parameters,
#                   scoring=scorer,
                  verbose = 10)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [40]:
# Run Grid Search

cv.fit(X_train.as_matrix(), Y_train.as_matrix())
# y_pred = cv.predict(X_test)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, score=0.236, total=  28.1s
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   28.2s remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, score=0.229, total=  27.8s
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   56.0s remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, score=0.243, total=  28.6s
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=50 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.4min remaining:    0.0s


KeyboardInterrupt: 

In [21]:
# Get results of grid search
cv.cv_results_

{'mean_fit_time': array([22.80210129, 86.10871808, 20.88734436, 72.27782631]),
 'std_fit_time': array([0.22460763, 3.12385852, 0.04054144, 5.05154817]),
 'mean_score_time': array([4.78515164, 9.46545712, 5.17601093, 9.43020558]),
 'std_score_time': array([0.05570086, 0.33590362, 0.14233711, 0.18329507]),
 'param_clf__estimator__min_samples_split': masked_array(data=[2, 2, 4, 4],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_clf__estimator__n_estimators': masked_array(data=[10, 50, 10, 50],
              mask=[False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'clf__estimator__min_samples_split': 2,
   'clf__estimator__n_estimators': 10},
  {'clf__estimator__min_samples_split': 2, 'clf__estimator__n_estimators': 50},
  {'clf__estimator__min_samples_split': 4, 'clf__estimator__n_estimators': 10},
  {'clf__estimator__min_samples_split': 4,
   'clf__estimator__n_estimators': 50}],
 'sp

In [22]:
# Best mean test score
np.max(cv.cv_results_['mean_test_score'])

0.9576340148509815

In [23]:
# Parameters for best mean test score
cv.best_params_

{'clf__estimator__min_samples_split': 4, 'clf__estimator__n_estimators': 50}

In [24]:
cv.best_estimator_

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                 MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True,
                                                                        class_weight=None,
                                                                        criterion='gini',
                                                                  

In [29]:
# Calculate evaluation metrics for test set
tuned_pred_test = cv.predict(X_test)

tuned_metrix = evaluate_model(np.array(Y_test), tuned_pred_test, col_names)

print(tuned_metrix)

                        Accuracy  Precision    Recall        F1
related                 0.821483   0.821483  0.821483  0.821483
request                 0.894721   0.894721  0.894721  0.894721
offer                   0.995423   0.995423  0.995423  0.995423
aid_related             0.791120   0.791120  0.791120  0.791120
medical_help            0.925542   0.925542  0.925542  0.925542
medical_products        0.949496   0.949496  0.949496  0.949496
search_and_rescue       0.973756   0.973756  0.973756  0.973756
security                0.982453   0.982453  0.982453  0.982453
military                0.968569   0.968569  0.968569  0.968569
water                   0.957278   0.957278  0.957278  0.957278
food                    0.938206   0.938206  0.938206  0.938206
shelter                 0.933323   0.933323  0.933323  0.933323
clothing                0.984132   0.984132  0.984132  0.984132
money                   0.977266   0.977266  0.977266  0.977266
missing_people          0.989472   0.989

In [30]:
# Get summary stats for first model
testing_metrics.describe()

Unnamed: 0,Accuracy,Precision,Recall,F1
count,35.0,35.0,35.0,35.0
mean,0.94287,0.94287,0.94287,0.94287
std,0.056552,0.056552,0.056552,0.056552
min,0.752212,0.752212,0.752212,0.752212
25%,0.935383,0.935383,0.935383,0.935383
50%,0.956973,0.956973,0.956973,0.956973
75%,0.982148,0.982148,0.982148,0.982148
max,0.996186,0.996186,0.996186,0.996186


In [31]:
# Get summary stats for tuned model
tuned_metrix.describe()

Unnamed: 0,Accuracy,Precision,Recall,F1
count,35.0,35.0,35.0,35.0
mean,0.946876,0.946876,0.946876,0.946876
std,0.050225,0.050225,0.050225,0.050225
min,0.79112,0.79112,0.79112,0.79112
25%,0.936909,0.936909,0.936909,0.936909
50%,0.957278,0.957278,0.957278,0.957278
75%,0.983293,0.983293,0.983293,0.983293
max,0.996033,0.996033,0.996033,0.996033


In [None]:
Tuning the model parameters has resulted in an increase in the median and mean (test) F1 score for the 
output classifiers. However, it is still the case that 50% of the ouput classifiers have an F1 score of less 
than 0.24, and 25% have an F1 score of less than 0.064. This is due to low recall values (i.e. the proportion 
of positive points that were correctly labelled). Ideally, we would like to try to improve on this.

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [None]:
# testing a pure decision tree classifier
moc = MultiOutputClassifier(AdaBoostClassifier(n_estimators=100)

pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', moc)
    ])

X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline.fit(X_train.as_matrix(), y_train.as_matrix())
y_pred = pipeline.predict(X_test)
results = get_results(y_test, y_pred)
results

### 9. Export your model as a pickle file

In [35]:
filename = 'classifier.pkl'
pickle.dump(pipeline, open(filename, 'wb'))


filename = 'classifier_tuned.pkl'
pickle.dump(cv, open(filename, 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.