# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import pickle

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import classification_report
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# load data from database
engine = create_engine('sqlite:///disaster.db')
df = pd.read_sql_table('disaster',con=engine)
# Names all columns starting at the fifth column as "categories"
categories = df.columns[4:]
# X contains values from message column and y contains values from categories columns
X = df[['message']].values[:, 0]
y = df[categories].values

### 2. Write a tokenization function to process your text data

In [3]:
def tokenize(text):
    # normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    # tokenize text
    tokens = word_tokenize(text) 
    stop_words = stopwords.words("english")
    lemmatizer = WordNetLemmatizer()
    # lemmatize andremove stop words
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return tokens
    pass



### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [4]:
def model_pipeline():
    pipeline = Pipeline([
        ('features', FeatureUnion([
            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
            ]))
        ])),

        ('clf', MultiOutputClassifier(RandomForestClassifier())),
        ])
    return pipeline


### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = model_pipeline()
model.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('text_pipeline', Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_d...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

In [6]:
model.get_params()

{'memory': None, 'steps': [('features', FeatureUnion(n_jobs=1,
          transformer_list=[('text_pipeline', Pipeline(memory=None,
        steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip...y=None)), ('tfidf', TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True))]))],
          transformer_weights=None)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
              

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [7]:
def multioutput_classification_report(y_true, y_pred):
    for i in range(0, len(categories)):
        print(categories[i] + ":")
        print("\tAccuracy: {:.4f}\tPrecision: {:.4f}\tRecall: {:.4f}\tF1_score: {:.4f}".format(
            accuracy_score(y_true[:, i], y_pred[:, i]),
            precision_score(y_true[:, i], y_pred[:, i], average='weighted'),
            recall_score(y_true[:, i], y_pred[:, i], average='weighted'),
            f1_score(y_true[:, i], y_pred[:, i], average='weighted')
        ))

In [8]:
# Predicts messages in the training set based on categories in the training set
y_pred = model.predict(X_train)

multioutput_classification_report(y_train, y_pred)

related:
	Accuracy: 0.9905	Precision: 0.9905	Recall: 0.9905	F1_score: 0.9905
request:
	Accuracy: 0.9870	Precision: 0.9872	Recall: 0.9870	F1_score: 0.9869
offer:
	Accuracy: 0.9988	Precision: 0.9988	Recall: 0.9988	F1_score: 0.9987
aid_related:
	Accuracy: 0.9852	Precision: 0.9854	Recall: 0.9852	F1_score: 0.9852
medical_help:
	Accuracy: 0.9886	Precision: 0.9887	Recall: 0.9886	F1_score: 0.9882
medical_products:
	Accuracy: 0.9919	Precision: 0.9919	Recall: 0.9919	F1_score: 0.9915
search_and_rescue:
	Accuracy: 0.9934	Precision: 0.9934	Recall: 0.9934	F1_score: 0.9930
security:
	Accuracy: 0.9957	Precision: 0.9957	Recall: 0.9957	F1_score: 0.9955
military:
	Accuracy: 0.9952	Precision: 0.9952	Recall: 0.9952	F1_score: 0.9950
water:
	Accuracy: 0.9942	Precision: 0.9942	Recall: 0.9942	F1_score: 0.9941
food:
	Accuracy: 0.9946	Precision: 0.9946	Recall: 0.9946	F1_score: 0.9946
shelter:
	Accuracy: 0.9914	Precision: 0.9915	Recall: 0.9914	F1_score: 0.9912
clothing:
	Accuracy: 0.9976	Precision: 0.9976	Recall:

In [9]:
# Predicts messages in the test set based on categories in the test set
y_pred = model.predict(X_test)

multioutput_classification_report(y_test, y_pred)

related:
	Accuracy: 0.8071	Precision: 0.7945	Recall: 0.8071	F1_score: 0.7976
request:
	Accuracy: 0.8772	Precision: 0.8690	Recall: 0.8772	F1_score: 0.8597
offer:
	Accuracy: 0.9960	Precision: 0.9920	Recall: 0.9960	F1_score: 0.9940
aid_related:
	Accuracy: 0.7460	Precision: 0.7454	Recall: 0.7460	F1_score: 0.7402
medical_help:
	Accuracy: 0.9242	Precision: 0.9009	Recall: 0.9242	F1_score: 0.8998
medical_products:
	Accuracy: 0.9519	Precision: 0.9403	Recall: 0.9519	F1_score: 0.9342
search_and_rescue:
	Accuracy: 0.9740	Precision: 0.9648	Recall: 0.9740	F1_score: 0.9626
security:
	Accuracy: 0.9813	Precision: 0.9695	Recall: 0.9813	F1_score: 0.9724
military:
	Accuracy: 0.9651	Precision: 0.9532	Recall: 0.9651	F1_score: 0.9507
water:
	Accuracy: 0.9524	Precision: 0.9464	Recall: 0.9524	F1_score: 0.9418
food:
	Accuracy: 0.9370	Precision: 0.9320	Recall: 0.9370	F1_score: 0.9321
shelter:
	Accuracy: 0.9399	Precision: 0.9347	Recall: 0.9399	F1_score: 0.9300
clothing:
	Accuracy: 0.9872	Precision: 0.9846	Recall:

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [10]:
def model_pipeline2():
    pipeline = Pipeline([
        ('features', FeatureUnion([
            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
            ]))
        ])),

        ('clf', MultiOutputClassifier(RandomForestClassifier()))
        ])
    
    parameters = {
        'features__text_pipeline__tfidf__use_idf': (True, False),
        'clf__estimator__n_estimators': [20,30]
        }
    cv = GridSearchCV(pipeline, param_grid=parameters)
    return cv

In [11]:
model2 = model_pipeline2()
model2.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('text_pipeline', Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_d...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'features__text_pipeline__tfidf__use_idf': (True, False), 'clf__estimator__n_estimators': [20, 30]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [16]:

y_pred = model2.predict(X_train)

multioutput_classification_report(y_train, y_pred)

# Predicts messages in the test set based on categories in the test set


related:
	Accuracy: 0.9973	Precision: 0.9973	Recall: 0.9973	F1_score: 0.9973
request:
	Accuracy: 0.9971	Precision: 0.9971	Recall: 0.9971	F1_score: 0.9971
offer:
	Accuracy: 0.9995	Precision: 0.9995	Recall: 0.9995	F1_score: 0.9995
aid_related:
	Accuracy: 0.9978	Precision: 0.9978	Recall: 0.9978	F1_score: 0.9978
medical_help:
	Accuracy: 0.9978	Precision: 0.9979	Recall: 0.9978	F1_score: 0.9978
medical_products:
	Accuracy: 0.9978	Precision: 0.9979	Recall: 0.9978	F1_score: 0.9978
search_and_rescue:
	Accuracy: 0.9979	Precision: 0.9979	Recall: 0.9979	F1_score: 0.9979
security:
	Accuracy: 0.9984	Precision: 0.9984	Recall: 0.9984	F1_score: 0.9983
military:
	Accuracy: 0.9994	Precision: 0.9994	Recall: 0.9994	F1_score: 0.9994
water:
	Accuracy: 0.9993	Precision: 0.9993	Recall: 0.9993	F1_score: 0.9993
food:
	Accuracy: 0.9993	Precision: 0.9993	Recall: 0.9993	F1_score: 0.9993
shelter:
	Accuracy: 0.9987	Precision: 0.9987	Recall: 0.9987	F1_score: 0.9987
clothing:
	Accuracy: 0.9994	Precision: 0.9994	Recall:

In [17]:

y_pred = model2.predict(X_test)

multioutput_classification_report(y_test, y_pred)

related:
	Accuracy: 0.8177	Precision: 0.8041	Recall: 0.8177	F1_score: 0.8008
request:
	Accuracy: 0.8889	Precision: 0.8827	Recall: 0.8889	F1_score: 0.8759
offer:
	Accuracy: 0.9960	Precision: 0.9920	Recall: 0.9960	F1_score: 0.9940
aid_related:
	Accuracy: 0.7733	Precision: 0.7720	Recall: 0.7733	F1_score: 0.7709
medical_help:
	Accuracy: 0.9258	Precision: 0.9059	Recall: 0.9258	F1_score: 0.9010
medical_products:
	Accuracy: 0.9521	Precision: 0.9415	Recall: 0.9521	F1_score: 0.9340
search_and_rescue:
	Accuracy: 0.9740	Precision: 0.9645	Recall: 0.9740	F1_score: 0.9629
security:
	Accuracy: 0.9811	Precision: 0.9679	Recall: 0.9811	F1_score: 0.9723
military:
	Accuracy: 0.9648	Precision: 0.9515	Recall: 0.9648	F1_score: 0.9518
water:
	Accuracy: 0.9582	Precision: 0.9546	Recall: 0.9582	F1_score: 0.9506
food:
	Accuracy: 0.9421	Precision: 0.9380	Recall: 0.9421	F1_score: 0.9378
shelter:
	Accuracy: 0.9402	Precision: 0.9355	Recall: 0.9402	F1_score: 0.9300
clothing:
	Accuracy: 0.9860	Precision: 0.9819	Recall:

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [20]:
from sklearn.tree import DecisionTreeClassifier

# Builds a pipeline using the AdaBoost classifier
pipeline_ada = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(
        AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1, class_weight='balanced'))
    ))
])

parameters_ada = {
    'clf__estimator__learning_rate': [0.1, 0.3],
    'clf__estimator__n_estimators': [200, 400]
}

cv_ada = GridSearchCV(estimator=pipeline_ada, param_grid=parameters_ada, cv=3, scoring='f1_weighted', verbose=3)

In [21]:
# Produces a function that represents a line that best fits all the categories.
# This line function can be used to estimate (predict) other results
cv_ada.fit(X_train, y_train)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=200 
[CV]  clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=200, score=0.6208885536962466, total= 4.9min
[CV] clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=200 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  5.3min remaining:    0.0s


[CV]  clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=200, score=0.6178235428039128, total= 4.9min
[CV] clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=200 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed: 10.6min remaining:    0.0s


[CV]  clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=200, score=0.6189139892051349, total= 4.9min
[CV] clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=400 
[CV]  clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=400, score=0.6311540140277182, total= 9.5min
[CV] clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=400 
[CV]  clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=400, score=0.6295619341926636, total= 9.5min
[CV] clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=400 
[CV]  clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=400, score=0.6285521138803205, total= 9.5min
[CV] clf__estimator__learning_rate=0.3, clf__estimator__n_estimators=200 
[CV]  clf__estimator__learning_rate=0.3, clf__estimator__n_estimators=200, score=0.6343908165005325, total= 4.9min
[CV] clf__estimator__learning_rate=0.3, clf__estimator__n_estimators=200 
[CV]  clf__estimator__learning_rate=0.3, clf__estimator

[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed: 92.9min finished


GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...er='best'),
          learning_rate=1.0, n_estimators=50, random_state=None),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'clf__estimator__learning_rate': [0.1, 0.3], 'clf__estimator__n_estimators': [200, 400]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='f1_weighted', verbose=3)

In [22]:
# Displays parameters with the best results for the Adaboost model
cv_ada.best_params_

{'clf__estimator__learning_rate': 0.3, 'clf__estimator__n_estimators': 400}

In [23]:
y_pred = cv_ada.predict(X_train)

multioutput_classification_report(y_train, y_pred)

related:
	Accuracy: 0.7748	Precision: 0.8524	Recall: 0.7748	F1_score: 0.7909
request:
	Accuracy: 0.8796	Precision: 0.8992	Recall: 0.8796	F1_score: 0.8860
offer:
	Accuracy: 0.9908	Precision: 0.9969	Recall: 0.9908	F1_score: 0.9930
aid_related:
	Accuracy: 0.7987	Precision: 0.7978	Recall: 0.7987	F1_score: 0.7979
medical_help:
	Accuracy: 0.8998	Precision: 0.9370	Recall: 0.8998	F1_score: 0.9127
medical_products:
	Accuracy: 0.9140	Precision: 0.9603	Recall: 0.9140	F1_score: 0.9303
search_and_rescue:
	Accuracy: 0.9136	Precision: 0.9756	Recall: 0.9136	F1_score: 0.9372
security:
	Accuracy: 0.9244	Precision: 0.9850	Recall: 0.9244	F1_score: 0.9484
military:
	Accuracy: 0.9705	Precision: 0.9829	Recall: 0.9705	F1_score: 0.9745
water:
	Accuracy: 0.9523	Precision: 0.9674	Recall: 0.9523	F1_score: 0.9571
food:
	Accuracy: 0.9580	Precision: 0.9620	Recall: 0.9580	F1_score: 0.9594
shelter:
	Accuracy: 0.9389	Precision: 0.9536	Recall: 0.9389	F1_score: 0.9437
clothing:
	Accuracy: 0.9881	Precision: 0.9932	Recall:

In [24]:
y_pred = cv_ada.predict(X_test)

multioutput_classification_report(y_test, y_pred)

related:
	Accuracy: 0.7458	Precision: 0.8242	Recall: 0.7458	F1_score: 0.7634
request:
	Accuracy: 0.8600	Precision: 0.8776	Recall: 0.8600	F1_score: 0.8665
offer:
	Accuracy: 0.9811	Precision: 0.9925	Recall: 0.9811	F1_score: 0.9867
aid_related:
	Accuracy: 0.7827	Precision: 0.7814	Recall: 0.7827	F1_score: 0.7814
medical_help:
	Accuracy: 0.8663	Precision: 0.9145	Recall: 0.8663	F1_score: 0.8851
medical_products:
	Accuracy: 0.8980	Precision: 0.9423	Recall: 0.8980	F1_score: 0.9158
search_and_rescue:
	Accuracy: 0.8892	Precision: 0.9624	Recall: 0.8892	F1_score: 0.9208
security:
	Accuracy: 0.8941	Precision: 0.9686	Recall: 0.8941	F1_score: 0.9280
military:
	Accuracy: 0.9553	Precision: 0.9673	Recall: 0.9553	F1_score: 0.9601
water:
	Accuracy: 0.9496	Precision: 0.9654	Recall: 0.9496	F1_score: 0.9549
food:
	Accuracy: 0.9462	Precision: 0.9517	Recall: 0.9462	F1_score: 0.9482
shelter:
	Accuracy: 0.9296	Precision: 0.9419	Recall: 0.9296	F1_score: 0.9342
clothing:
	Accuracy: 0.9783	Precision: 0.9859	Recall:

### 9. Export your model as a pickle file

In [25]:
with open('adaboost_cv.pkl', 'wb') as file:
    pickle.dump(cv_ada, file)

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.