# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [400]:
# import libraries
import numpy as np
import pandas as pd
from sqlalchemy import create_engine
import re
import pickle

!pip install termcolor
from termcolor import colored, cprint

import nltk
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger', 'wordnet'])

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk import pos_tag
from nltk.corpus import stopwords

from sklearn.metrics import classification_report, confusion_matrix, precision_recall_fscore_support
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier

import warnings

warnings.simplefilter('ignore')



[nltk_data] Downloading package punkt to /Users/Saleh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/Saleh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/Saleh/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /Users/Saleh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [210]:
# load data from database

def load_data():
    engine = create_engine('sqlite:///disaster_messages.db')
    df = pd.read_sql_table('Messages', engine)
    X = df.message
    y = df.drop(['id', 'message', 'original'], axis=1)
    return X, y
print(X.sample(1))
y.sample(1)

24371    With this generous donation confirmed we will ...
Name: message, dtype: object


Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,...,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report,direct,news
24115,1,0,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1


---
### 2. Write a tokenization function to process your text data

---

In [211]:
def tokenize(text):
    # Normalize text
    text = re.sub(r'[^a-zA-Z0-9]', ' ', text.lower())

    # tokenize text
    tokens = word_tokenize(text)
    
    # initiate lemmatizer
    lemmatizer = WordNetLemmatizer()

    # iterate through each token
    clean_tokens = []
    for tok in tokens:
        
        # lemmatize, normalize case, and remove leading/trailing white space
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

---
### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

---

In [441]:
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)), 
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier())) # this strategy consists of fitting one classifier per target.
])

---
### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

---

In [443]:
X, y = load_data()
X_train, X_test, y_train, y_test = train_test_split(X, y)
 
# train classifier
model = pipeline.fit(X_train, y_train)
model

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip..._score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=None))])

---
### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each. [Source for using classification_report](http://joshlawman.com/metrics-classification-report-breakdown-precision-recall-f1/)

---

In [504]:
def classifaction_report(model, X_test, y_test):
    # predict on test data
    y_pred = model.predict(X_test)
    
    # listing the columns
    columns = list(np.array(y.columns))
    for i, col in enumerate(columns):
        target_names = [col+'_0', col+'_1']
        print ('\nClassification Report for the column ({}):\n'.format(colored(col, 'blue', attrs=['bold', 'underline'])), 
               classification_report(y_test[col],
                                     y_pred[:, i],
                                     target_names=target_names))

In [505]:
classifaction_report_1 = classifaction_report(model, X_test, y_test)
classifaction_report_1


Classification Report for the column ([4m[1m[34mrelated[0m):
               precision    recall  f1-score   support

   related_0       0.62      0.36      0.46      1483
   related_1       0.83      0.94      0.88      5024

   micro avg       0.80      0.80      0.80      6507
   macro avg       0.73      0.65      0.67      6507
weighted avg       0.78      0.80      0.78      6507


Classification Report for the column ([4m[1m[34mrequest[0m):
               precision    recall  f1-score   support

   request_0       0.88      0.98      0.93      5362
   request_1       0.84      0.39      0.53      1145

   micro avg       0.88      0.88      0.88      6507
   macro avg       0.86      0.69      0.73      6507
weighted avg       0.88      0.88      0.86      6507


Classification Report for the column ([4m[1m[34moffer[0m):
               precision    recall  f1-score   support

     offer_0       1.00      1.00      1.00      6476
     offer_1       0.00      0.00     

In [533]:
def scores_report(model, X_test, y_test):
    # predict on test data
    y_pred = model.predict(X_test)
    
    # listing the columns
    columns = list(np.array(y.columns))
    for i, col in enumerate(columns):
        precision, recall, fscore, support = precision_recall_fscore_support(y_test[col],
                                                                    y_pred[:, i],
                                                                    average='weighted')
        
        print('\nReport for the column ({}):\n'.format(colored(col, 'blue', attrs=['bold', 'underline'])))
        
        if precision >= 0.75:
            print('Precision: {}'.format(colored(round(precision, 2), 'green')))
        else:
            print('Precision: {}'.format(colored(round(precision, 2), 'red')))
            
        if recall >= 0.75:
            print('Recall: {}'.format(colored(round(recall, 2), 'green')))
        else:
            print('Recall: {}'.format(colored(round(recall, 2), 'red')))
            
        if fscore >= 0.75:
            print('F-score: {}'.format(colored(round(fscore, 2), 'green')))
        else:
            print('F-score: {}'.format(colored(round(fscore, 2), 'red')))

In [525]:
scores_report = scores_report(model, X_test, y_test)
scores_report


Report for the column ([4m[1m[34mrelated[0m):

Precision: [32m0.78[0m
Recall: [32m0.8[0m
F-score: [32m0.78[0m

Report for the column ([4m[1m[34mrequest[0m):

Precision: [32m0.88[0m
Recall: [32m0.88[0m
F-score: [32m0.86[0m

Report for the column ([4m[1m[34moffer[0m):

Precision: [32m0.99[0m
Recall: [32m1.0[0m
F-score: [32m0.99[0m

Report for the column ([4m[1m[34maid_related[0m):

Precision: [31m0.74[0m
Recall: [31m0.74[0m
F-score: [31m0.74[0m

Report for the column ([4m[1m[34mmedical_help[0m):

Precision: [32m0.89[0m
Recall: [32m0.92[0m
F-score: [32m0.89[0m

Report for the column ([4m[1m[34mmedical_products[0m):

Precision: [32m0.93[0m
Recall: [32m0.95[0m
F-score: [32m0.93[0m

Report for the column ([4m[1m[34msearch_and_rescue[0m):

Precision: [32m0.97[0m
Recall: [32m0.98[0m
F-score: [32m0.97[0m

Report for the column ([4m[1m[34msecurity[0m):

Precision: [32m0.97[0m
Recall: [32m0.98[0m
F-score: [32m0.98[0m

---
### 6. Improve your model
Use grid search to find better parameters. 

---

In [449]:
model.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x1a33edb510>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
               oob_score=False, random_state=

In [452]:
# specify parameters for grid search
parameters = {'vect__min_df': [1, 5],
              'tfidf__use_idf': (True, False), 
              'clf__estimator__n_estimators': [10, 25], 
              'clf__estimator__min_samples_split': [2, 4]}

# create grid search object
cv = GridSearchCV(model, param_grid=parameters)
model_2 = cv.fit(X_train, y_train)

---
### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

---

In [503]:
classifaction_report_2 = classifaction_report(model_2, X_test, y_test)
classifaction_report_2


Classification Report for the column ([4m[1m[34mrelated[0m):
               precision    recall  f1-score   support

   related_0       0.72      0.33      0.45      1483
   related_1       0.83      0.96      0.89      5024

   micro avg       0.82      0.82      0.82      6507
   macro avg       0.77      0.64      0.67      6507
weighted avg       0.80      0.82      0.79      6507


Classification Report for the column ([4m[1m[34mrequest[0m):
               precision    recall  f1-score   support

   request_0       0.90      0.98      0.94      5362
   request_1       0.85      0.46      0.60      1145

   micro avg       0.89      0.89      0.89      6507
   macro avg       0.87      0.72      0.77      6507
weighted avg       0.89      0.89      0.88      6507


Classification Report for the column ([4m[1m[34moffer[0m):
               precision    recall  f1-score   support

     offer_0       1.00      1.00      1.00      6476
     offer_1       0.00      0.00     

In [524]:
scores_report_2 = scores_report(model_2, X_test, y_test)
scores_report_2


Report for the column ([4m[1m[34mrelated[0m):

Precision: [32m0.8[0m
Recall: [32m0.82[0m
F-score: [32m0.79[0m

Report for the column ([4m[1m[34mrequest[0m):

Precision: [32m0.89[0m
Recall: [32m0.89[0m
F-score: [32m0.88[0m

Report for the column ([4m[1m[34moffer[0m):

Precision: [32m0.99[0m
Recall: [32m0.99[0m
F-score: [32m0.99[0m

Report for the column ([4m[1m[34maid_related[0m):

Precision: [32m0.76[0m
Recall: [32m0.76[0m
F-score: [32m0.76[0m

Report for the column ([4m[1m[34mmedical_help[0m):

Precision: [32m0.89[0m
Recall: [32m0.92[0m
F-score: [32m0.89[0m

Report for the column ([4m[1m[34mmedical_products[0m):

Precision: [32m0.94[0m
Recall: [32m0.95[0m
F-score: [32m0.94[0m

Report for the column ([4m[1m[34msearch_and_rescue[0m):

Precision: [32m0.97[0m
Recall: [32m0.98[0m
F-score: [32m0.97[0m

Report for the column ([4m[1m[34msecurity[0m):

Precision: [32m0.97[0m
Recall: [32m0.98[0m
F-score: [32m0.98[0

In [463]:
model_2.best_params_

{'clf__estimator__min_samples_split': 4,
 'clf__estimator__n_estimators': 25,
 'tfidf__use_idf': True,
 'vect__min_df': 5}

---
### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

---

In [528]:
pipeline_2 = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(AdaBoostClassifier()))
])

pipeline_2.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=None, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
             learning_rate=1.0, n_estimators=50, random_state=None),
              n_jobs=None))],
 'vect': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=1.0, max_features=None, min_df=1,
         ngram_range=(1, 1), preprocessor=None, s

In [529]:
# train classifier
model_3 = pipeline_2.fit(X_train, y_train)
model_3

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...or=None,
          learning_rate=1.0, n_estimators=50, random_state=None),
           n_jobs=None))])

In [530]:
classifaction_report_3 = classifaction_report(model_3, X_test, y_test)
classifaction_report_3


Classification Report for the column ([4m[1m[34mrelated[0m):
               precision    recall  f1-score   support

   related_0       0.68      0.25      0.37      1483
   related_1       0.81      0.97      0.88      5024

   micro avg       0.80      0.80      0.80      6507
   macro avg       0.75      0.61      0.63      6507
weighted avg       0.78      0.80      0.77      6507


Classification Report for the column ([4m[1m[34mrequest[0m):
               precision    recall  f1-score   support

   request_0       0.91      0.97      0.94      5362
   request_1       0.78      0.54      0.64      1145

   micro avg       0.89      0.89      0.89      6507
   macro avg       0.85      0.76      0.79      6507
weighted avg       0.89      0.89      0.89      6507


Classification Report for the column ([4m[1m[34moffer[0m):
               precision    recall  f1-score   support

     offer_0       1.00      1.00      1.00      6476
     offer_1       0.00      0.00     

In [534]:
scores_report_3 = scores_report(model_3, X_test, y_test)
scores_report_3


Report for the column ([4m[1m[34mrelated[0m):

Precision: [32m0.78[0m
Recall: [32m0.8[0m
F-score: [32m0.77[0m

Report for the column ([4m[1m[34mrequest[0m):

Precision: [32m0.89[0m
Recall: [32m0.89[0m
F-score: [32m0.89[0m

Report for the column ([4m[1m[34moffer[0m):

Precision: [32m0.99[0m
Recall: [32m0.99[0m
F-score: [32m0.99[0m

Report for the column ([4m[1m[34maid_related[0m):

Precision: [32m0.75[0m
Recall: [32m0.75[0m
F-score: [31m0.75[0m

Report for the column ([4m[1m[34mmedical_help[0m):

Precision: [32m0.91[0m
Recall: [32m0.93[0m
F-score: [32m0.91[0m

Report for the column ([4m[1m[34mmedical_products[0m):

Precision: [32m0.95[0m
Recall: [32m0.96[0m
F-score: [32m0.95[0m

Report for the column ([4m[1m[34msearch_and_rescue[0m):

Precision: [32m0.97[0m
Recall: [32m0.98[0m
F-score: [32m0.97[0m

Report for the column ([4m[1m[34msecurity[0m):

Precision: [32m0.97[0m
Recall: [32m0.98[0m
F-score: [32m0.98[0

### 9. Export your model as a pickle file

In [535]:
f = 'model.pkl'
pickle.dump(model_2, open(f, 'wb'))

# Open the file to save as pkl file
model_pkl = open(f, 'wb')
pickle.dump(model_2, model_pkl)
# Close the pickle instances
decision_tree_model_pkl.close()

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.