# ML Pipeline Preparation

### 1. Import libraries and load data from database
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [82]:
# import libraries
import nltk
nltk.download(['averaged_perceptron_tagger', 'wordnet'])
from nltk import pos_tag
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

import numpy as np
import pandas as pd
import pickle
import re

from sqlalchemy import create_engine

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.multioutput import MultiOutputClassifier

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/faustina/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/faustina/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
# load data from database
engine = create_engine('sqlite:///web_app/data/DisasterResponse.db')
df = pd.read_sql_table('DisasterResponseData', engine)
X = df['message']
Y = df[df.columns[-36:]]

### 2. Write a tokenization function to process text data

In [91]:
def tokenize(text):
    """
    Transforms a text to clean tokens, where every token is a word converted to lower case,
    passed to a part-of-speech tagger and lemmatized accordingly.
    Words recognized as stopwords are ommitted.
    
    Input:
        text (str)
        
    Output:
        clean_tokens (list): list of clean tokens (words converted to lower case and lemmatized)
        
    """
    
    tokenizer = RegexpTokenizer('\w+')
    lemmatizer = WordNetLemmatizer()

    tokens = tokenizer.tokenize(text.lower())
    
    clean_tokens = []
    
    for word, tag in pos_tag(tokens):
        if tag[0] in ['A', 'R', 'N', 'V']:
            tag = tag[0].lower()
            clean_token = lemmatizer.lemmatize(word, pos=tag)
        else:
            clean_token = word
            
        if clean_token not in stopwords.words('english'):
            clean_tokens.append(clean_token)
        
    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline takes in the `message` column as input and outputs classification results on the other 36 categories in the dataset.

In [94]:
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier())),
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline in batches

In [194]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

In [199]:
batch_size = 100

for i in range(0, X_train.shape[0], batch_size):
    start = i
    end = i + batch_size
    pipeline.fit(X_train[start : end], Y_train[start : end])
    print('\n Trained {} examples. {}% completed.'.format(end, round(end/X_train.shape[0] * 100, 2)))
    end = start
    
    if end == (X_train.shape[0] // 100) * 100:
        print('\n Trained {} examples. {}% completed.'.format(end, round(end/X_train.shape[0] * 100, 2)))
        pipeline.fit(X_train[end:], Y_train[end:])


 Trained 100 examples. 0.51% completed.

 Trained 200 examples. 1.02% completed.

 Trained 300 examples. 1.53% completed.

 Trained 400 examples. 2.04% completed.

 Trained 500 examples. 2.54% completed.

 Trained 600 examples. 3.05% completed.

 Trained 700 examples. 3.56% completed.

 Trained 800 examples. 4.07% completed.

 Trained 900 examples. 4.58% completed.

 Trained 1000 examples. 5.09% completed.

 Trained 1100 examples. 5.6% completed.

 Trained 1200 examples. 6.11% completed.

 Trained 1300 examples. 6.61% completed.

 Trained 1400 examples. 7.12% completed.

 Trained 1500 examples. 7.63% completed.

 Trained 1600 examples. 8.14% completed.

 Trained 1700 examples. 8.65% completed.

 Trained 1800 examples. 9.16% completed.

 Trained 1900 examples. 9.67% completed.

 Trained 2000 examples. 10.18% completed.

 Trained 2100 examples. 10.68% completed.

 Trained 2200 examples. 11.19% completed.

 Trained 2300 examples. 11.7% completed.

 Trained 2400 examples. 12.21% completed


 Trained 19100 examples. 97.18% completed.

 Trained 19200 examples. 97.69% completed.

 Trained 19300 examples. 98.19% completed.

 Trained 19400 examples. 98.7% completed.

 Trained 19500 examples. 99.21% completed.

 Trained 19600 examples. 99.72% completed.

 Trained 19700 examples. 100.23% completed.

 19600


### 5. Test model
Report the precision, recall, f1 score for each output category of the dataset, and overall accuracy score.

In [192]:
def report_classification(y_test, y_pred):
    """
    
    """
    
    for idx, col in enumerate(y_test):
        set_y_pair = (y_test[col], y_pred[:, idx])
        avg='weighted'
        rep_col = "{}\n\tPrecision: {:.2f}%\n\tRecall: {:.2f}%\n\tF1 Score: {:.2f}%\n".format(col,
                                                                                 precision_score(*set_y_pair, average=avg), 
                                                                                 recall_score(*set_y_pair, average=avg), 
                                                                                 f1_score(*set_y_pair, average=avg))
        print(rep_col)
        
    print('Accuracy Score: {:.2f}%'.format(np.mean(y_test.values == y_pred)))

    return np.mean(y_test.values == y_pred)

In [195]:
Y_pred = pipeline.predict(X_test)

report_classification(Y_test, Y_pred)

related
	Precision: 0.64%
	Recall: 0.76%
	F1 Score: 0.67%

request
	Precision: 0.72%
	Recall: 0.76%
	F1 Score: 0.74%

offer
	Precision: 0.99%
	Recall: 0.99%
	F1 Score: 0.99%

aid_related
	Precision: 0.50%
	Recall: 0.55%
	F1 Score: 0.50%

medical_help
	Precision: 0.86%
	Recall: 0.88%
	F1 Score: 0.87%

medical_products
	Precision: 0.91%
	Recall: 0.90%
	F1 Score: 0.91%

search_and_rescue
	Precision: 0.95%
	Recall: 0.97%
	F1 Score: 0.96%



  _warn_prf(average, modifier, msg_start, len(result))


security
	Precision: 0.96%
	Recall: 0.96%
	F1 Score: 0.96%

military
	Precision: 0.93%
	Recall: 0.95%
	F1 Score: 0.94%

child_alone
	Precision: 1.00%
	Recall: 1.00%
	F1 Score: 1.00%

water
	Precision: 0.88%
	Recall: 0.91%
	F1 Score: 0.89%

food
	Precision: 0.80%
	Recall: 0.88%
	F1 Score: 0.83%

shelter
	Precision: 0.85%
	Recall: 0.88%
	F1 Score: 0.86%

clothing
	Precision: 0.97%
	Recall: 0.97%
	F1 Score: 0.97%

money
	Precision: 0.96%
	Recall: 0.98%
	F1 Score: 0.97%

missing_people
	Precision: 0.98%
	Recall: 0.98%
	F1 Score: 0.98%

refugees
	Precision: 0.94%
	Recall: 0.97%
	F1 Score: 0.95%

death
	Precision: 0.90%
	Recall: 0.95%
	F1 Score: 0.93%

other_aid
	Precision: 0.78%
	Recall: 0.86%
	F1 Score: 0.81%

infrastructure_related
	Precision: 0.88%
	Recall: 0.90%
	F1 Score: 0.89%

transport
	Precision: 0.91%
	Recall: 0.94%
	F1 Score: 0.92%

buildings
	Precision: 0.91%
	Recall: 0.94%
	F1 Score: 0.92%

electricity
	Precision: 0.96%
	Recall: 0.98%
	F1 Score: 0.97%

tools
	Precision: 0.99%
	

0.8985000339167006

### 6. Improve model
Use grid search to find better parameters. 

In [196]:
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                   dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                   lowercase=True, max_df=1.0, max_features=None, min_df=1,
                   ngram_range=(1, 1), preprocessor=None, stop_words=None,
                   strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                   tokenizer=<function tokenize at 0x7fbabc8fd320>,
                   vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True,
                                                          ccp_alpha=0.0,
                                                          class_weight=None,
                                                          criterion='gini',
                                                          max_depth=No

In [None]:
parameters = {}

cv = GridSearchCV(pipeline, parameters)
cv

### 7. Test model
Show the precision, recall and overall accuracy of the tuned model. 

In [None]:
cv.fit(X_train, y_train)

In [None]:
y_pred = cv.predict(X_test)

report_classification(y_test, y_pred)

In [None]:
cv.best_params_

### 8. Improve model


### 9. Export your model as a pickle file

In [None]:
with open('classifier.pkl', 'wb') as f:
    pickle.dump(, f)

### 10. Use this notebook to complete `train_classifier.py`

In [None]:
%%writefile web_app/models/train_classifier.py