# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import numpy as np
import pandas as pd
import pickle
import re
import nltk
from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer,TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, recall_score, precision_score, f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neural_network import MLPClassifier

In [2]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('disaster_messages', engine)
X = df['message']
Y = df.loc[:, 'related':'direct_report']
X.head(5)

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
Name: message, dtype: object

In [3]:
nltk.download(['punkt', 'wordnet'])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

### 2. Write a tokenization function to process your text data

In [4]:
def tokenize(text):
    """
    Tokenization involves 4 main steps:
    Replacing non-numeric and non-alphabets
    Tokenzation of words
    Lemmatization of tokenized words to its root form
    Stripping of white spaceand finally returns an array of stemmed tokens
    """
    # replace all non-alphabets / non-numbers with blank space
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # Tokenize words
    tokens = word_tokenize(text)
    
    # instantiate lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # instantiate stemmer
    stemmer = PorterStemmer()
    
    clean_tokens = []
    for tok in tokens:
        # lemmatize token using noun as part of speech
        clean_tok = lemmatizer.lemmatize(tok)
        # lemmatize token using verb as part of speech
        clean_tok = lemmatizer.lemmatize(clean_tok, pos='v')
        # stem token
        clean_tok = stemmer.stem(clean_tok)
        # strip whitespace and append clean token to array
        clean_tokens.append(clean_tok.strip())
        
    return clean_tokens

In [5]:
text = X[0]
print(text)
tokenize(text)

Weather update - a cold front from Cuba that could pass over Haiti


['weather',
 'updat',
 'a',
 'cold',
 'front',
 'from',
 'cuba',
 'that',
 'could',
 'pa',
 'over',
 'haiti']

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [6]:
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier(random_state=42),n_jobs=4))
    ])


### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [7]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
# train classifier
pipeline.fit(X_train, Y_train)
# predict on test data
Y_pred = pipeline.predict(X_test)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [8]:
# Get names of all categories
category_names = Y_test.columns.tolist()

Y_pred_df = pd.DataFrame(Y_pred, columns = category_names)
Y_pred_df.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,0,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
for i in range(36):
    print(category_names[i],\
          '\n',\
          classification_report(Y_test.iloc[:,i], Y_pred_df.iloc[:,i]))

related 
              precision    recall  f1-score   support

          0       0.63      0.37      0.46      1266
          1       0.82      0.93      0.87      3938
          2       0.62      0.45      0.52        40

avg / total       0.77      0.79      0.77      5244

request 
              precision    recall  f1-score   support

          0       0.89      0.98      0.93      4349
          1       0.83      0.39      0.53       895

avg / total       0.88      0.88      0.86      5244

offer 
              precision    recall  f1-score   support

          0       1.00      1.00      1.00      5218
          1       0.00      0.00      0.00        26

avg / total       0.99      1.00      0.99      5244

aid_related 
              precision    recall  f1-score   support

          0       0.74      0.88      0.80      3113
          1       0.75      0.55      0.64      2131

avg / total       0.75      0.74      0.74      5244

medical_help 
              precision    reca

  'precision', 'predicted', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [10]:
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7fb391f7e400>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
               oob_score=False, random_state=42, v

In [11]:
pipeline = Pipeline([('cvect', CountVectorizer(tokenizer = tokenize)),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultiOutputClassifier(RandomForestClassifier()))
                    ])

parameters = {'clf__estimator__n_estimators': [25, 50],
              'clf__estimator__min_samples_split': [2,3]}
     
cv = GridSearchCV(pipeline, param_grid=parameters, verbose=1, n_jobs=3)

In [12]:
cv.fit(X_train, Y_train)

Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=3)]: Done  12 out of  12 | elapsed: 40.3min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('cvect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        stri...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=3,
       param_grid={'clf__estimator__n_estimators': [25, 50], 'clf__estimator__min_samples_split': [2, 3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [13]:
cv_pred = cv.predict(X_test)
cv_pred_df = pd.DataFrame(cv_pred, columns = category_names)
cv_pred_df.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
overall_accuracy_cv = (cv_pred == Y_test).mean().mean()
overall_accuracy_normal = (Y_pred == Y_test).mean().mean()
print(overall_accuracy_normal)
print(overall_accuracy_cv)

0.942818247309
0.946478515128


In [15]:
for i in range(36):
    print(category_names[i],\
          '\n',\
          classification_report(Y_test.iloc[:,i], Y_pred_df.iloc[:,i]))

related 
              precision    recall  f1-score   support

          0       0.63      0.37      0.46      1266
          1       0.82      0.93      0.87      3938
          2       0.62      0.45      0.52        40

avg / total       0.77      0.79      0.77      5244

request 
              precision    recall  f1-score   support

          0       0.89      0.98      0.93      4349
          1       0.83      0.39      0.53       895

avg / total       0.88      0.88      0.86      5244

offer 
              precision    recall  f1-score   support

          0       1.00      1.00      1.00      5218
          1       0.00      0.00      0.00        26

avg / total       0.99      1.00      0.99      5244

aid_related 
              precision    recall  f1-score   support

          0       0.74      0.88      0.80      3113
          1       0.75      0.55      0.64      2131

avg / total       0.75      0.74      0.74      5244

medical_help 
              precision    reca

  'precision', 'predicted', average, warn_for)


In [16]:
for i in range(36):
    category = category_names[i]
    accuracy = accuracy_score(Y_test.iloc[:,i], Y_pred_df.iloc[:,i])
    precision = precision_score(Y_test.iloc[:,i], Y_pred_df.iloc[:,i], average='micro')
    recall = recall_score(Y_test.iloc[:,i], Y_pred_df.iloc[:,i], average='micro')
    f1 = f1_score(Y_test.iloc[:,i], Y_pred_df.iloc[:,i], average='micro')
    print(category)
    print("\t Accuracy: %.4f \t Precision: %.4f \t Recall: %.4f \t F1-Score: %.4f \n" %\
              (accuracy, precision, recall, f1))

related
	 Accuracy: 0.7899 	 Precision: 0.7899 	 Recall: 0.7899 	 F1-Score: 0.7899 

request
	 Accuracy: 0.8822 	 Precision: 0.8822 	 Recall: 0.8822 	 F1-Score: 0.8822 

offer
	 Accuracy: 0.9950 	 Precision: 0.9950 	 Recall: 0.9950 	 F1-Score: 0.9950 

aid_related
	 Accuracy: 0.7449 	 Precision: 0.7449 	 Recall: 0.7449 	 F1-Score: 0.7449 

medical_help
	 Accuracy: 0.9211 	 Precision: 0.9211 	 Recall: 0.9211 	 F1-Score: 0.9211 

medical_products
	 Accuracy: 0.9521 	 Precision: 0.9521 	 Recall: 0.9521 	 F1-Score: 0.9521 

search_and_rescue
	 Accuracy: 0.9760 	 Precision: 0.9760 	 Recall: 0.9760 	 F1-Score: 0.9760 

security
	 Accuracy: 0.9826 	 Precision: 0.9826 	 Recall: 0.9826 	 F1-Score: 0.9826 

military
	 Accuracy: 0.9697 	 Precision: 0.9697 	 Recall: 0.9697 	 F1-Score: 0.9697 

child_alone
	 Accuracy: 1.0000 	 Precision: 1.0000 	 Recall: 1.0000 	 F1-Score: 1.0000 

water
	 Accuracy: 0.9466 	 Precision: 0.9466 	 Recall: 0.9466 	 F1-Score: 0.9466 

food
	 Accuracy: 0.9144 	 Precision

In [17]:
cv.best_params_

{'clf__estimator__min_samples_split': 2, 'clf__estimator__n_estimators': 50}

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [18]:
improved_pipeline = Pipeline([('tfidf', TfidfVectorizer(tokenizer=tokenize)),
                     ('clf', MultiOutputClassifier(AdaBoostClassifier(random_state=42),n_jobs=4))])
parameters = {
    #'tfidf__max_df': (0.9, 1.0),
    #'tfidf__min_df': (0.01, 1),
    'tfidf__ngram_range': ((1, 1),(1,3)),
    #'tfidf__stop_words': (None, 'english'),
    #'clf__estimator__learning_rate': (0.1,1.0),
    #'clf__estimator__n_estimators': (50, 100)
}
improved_pipeline.fit(X_train, Y_train)

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...timator=None,
          learning_rate=1.0, n_estimators=50, random_state=42),
           n_jobs=4))])

In [19]:
cv1 = GridSearchCV(improved_pipeline, param_grid=parameters, verbose=2, n_jobs=4)
cv1.fit(X_train, Y_train)
cv_pred_1 = cv1.predict(X_test)
cv_pred_df_1 = pd.DataFrame(cv_pred_1, columns = category_names)
cv_pred_df_1.head()

Fitting 3 folds for each of 2 candidates, totalling 6 fits
[CV] tfidf__ngram_range=(1, 1) .......................................
[CV] tfidf__ngram_range=(1, 1) .......................................
[CV] tfidf__ngram_range=(1, 1) .......................................
[CV] tfidf__ngram_range=(1, 3) .......................................


  **self._backend_args)
  **self._backend_args)
  **self._backend_args)
  **self._backend_args)
  **self._backend_args)
  **self._backend_args)
  **self._backend_args)


[CV] ........................ tfidf__ngram_range=(1, 1), total= 9.7min
[CV] tfidf__ngram_range=(1, 3) .......................................
[CV] ........................ tfidf__ngram_range=(1, 1), total= 9.8min
[CV] tfidf__ngram_range=(1, 3) .......................................
[CV] ........................ tfidf__ngram_range=(1, 1), total= 9.8min


[Parallel(n_jobs=4)]: Done   3 out of   6 | elapsed: 11.5min remaining: 11.5min
  **self._backend_args)
  **self._backend_args)
  **self._backend_args)


[CV] ........................ tfidf__ngram_range=(1, 3), total=40.9min


  **self._backend_args)
  **self._backend_args)


[CV] ........................ tfidf__ngram_range=(1, 3), total=35.8min
[CV] ........................ tfidf__ngram_range=(1, 3), total=36.4min


[Parallel(n_jobs=4)]: Done   6 out of   6 | elapsed: 49.1min finished


Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
overall_accuracy_cv = (cv_pred_1 == Y_test).mean().mean()
print(overall_accuracy_cv)

0.947797482838


**Comparison**
Random Forest with cross validation gave an accuracy of 94.67% while ADA Boost had a marginal increase with cross validation resulting in 94.77%. Ofcourse to fetch 94.67 % with Random Forest, had to play with different parameters by employing a grid search to optimize the results. Further optimization with a ADA boost resulted in a marginal increase.

In [21]:
improved_pred = improved_pipeline.predict(X_test)
improved_pred_df = pd.DataFrame(improved_pred, columns = Y_test.columns)
for column in Y_test.columns:
    print('------------------------------------------------------\n')
    print('Accuracy: ', accuracy_score(Y_test[column], improved_pred_df[column]))
    print('Feature: {}\n'.format(column))
    print(classification_report(Y_test[column],improved_pred_df[column]))

------------------------------------------------------

Accuracy:  0.768497330282
Feature: related

             precision    recall  f1-score   support

          0       0.60      0.23      0.33      1266
          1       0.79      0.95      0.86      3938
          2       0.15      0.07      0.10        40

avg / total       0.74      0.77      0.73      5244

------------------------------------------------------

Accuracy:  0.889206712433
Feature: request

             precision    recall  f1-score   support

          0       0.91      0.97      0.94      4349
          1       0.76      0.51      0.61       895

avg / total       0.88      0.89      0.88      5244

------------------------------------------------------

Accuracy:  0.994660564455
Feature: offer

             precision    recall  f1-score   support

          0       1.00      1.00      1.00      5218
          1       0.33      0.08      0.12        26

avg / total       0.99      0.99      0.99      5244

----

In [24]:
overall_accuracy = (improved_pred == Y_test).mean().mean()
overall_accuracy

0.94779748283752874

In [26]:
cv1.best_params_

{'tfidf__ngram_range': (1, 1)}

### 9. Export your model as a pickle file

In [27]:
filename = 'bestClassifier.pkl'
pickle.dump(cv1, open(filename, 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.