# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries

import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import re
import warnings
warnings.filterwarnings("ignore")
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download(['wordnet', 'averaged_perceptron_tagger'])
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
import pickle

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [2]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql("SELECT * FROM clean_data", engine)
X = df['message'].values
Y = df[df.columns[4:]]


In [3]:
# Display the value of X:

X

array(['Weather update - a cold front from Cuba that could pass over Haiti',
       'Is the Hurricane over or is it not over',
       'Looking for someone but no name', ...,
       "Proshika, operating in Cox's Bazar municipality and 5 other unions, Ramu and Chokoria, assessment, 5 kg rice, 1,5 kg lentils to 700 families.",
       'Some 2,000 women protesting against the conduct of the elections were teargassed as they tried to converge on the local electoral commission offices in the southern oil city of Port Harcourt.',
       'A radical shift in thinking came about as a result of this meeting, recognizing that HIV/AIDS is at the core of the humanitarian crisis and identifying the crisis itself as a function of the HIV/AIDS pandemic.'], dtype=object)

In [4]:
# Display the value of Y:

Y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# Display the various columns:

df.columns

### 2. Write a tokenization function to process your text data

In [5]:
url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

def tokenize(text):
    
    """
    Outputs editted version of the input Python str object `text` 
    replacing all urls in text with str 'urlplaceholder'.
    
    Takes a Python string object and outputs list of processed words 
       of the text.
    
    INPUT:
        - text - Python str object - a raw text data
        
    OUTPUT:
        - text - Python str object - An editted version of the input data `text` 
          with all urls in text replacing with str 'urlplaceholder'.
        - tokens - Python list object - list of processed words using the input `text`.
        
    """
    
    # get list of all urls using regex
    detected_urls = re.findall(url_regex, text)
    
    # replace each url in text string with placeholder
    text = text.lower()
    text = re.sub(r"[^a-zA-Z0-9]"," ",text)
    
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    words = word_tokenize(text)
    
    tokens = [ele for ele in words if ele not in stopwords.words('english')] 
    
    lemmatizer = WordNetLemmatizer()
    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens
    

In [6]:
# Print the function:

print(tokenize(X[4]))

['say', 'west', 'side', 'haiti', 'rest', 'country', 'today', 'tonight']


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [8]:
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier(n_jobs = -1)))
    ])

In [9]:
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7f7b5ab6b2f0>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
               oob_score=False, random_state=None

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [10]:
# Split data into train and test:

X_train, X_test, y_train, y_test = train_test_split(X, Y)

In [11]:
# Fit model:

pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

In [12]:
# Predict using test data:

y_pred = pipeline.predict(X_test)

In [13]:
# check rows n columns:

y_pred.shape, y_test.shape, len(list(Y.columns))

((6554, 36), (6554, 36), 36)

In [14]:
# Check accuracy:

labels = np.unique(y_pred)

accuracy = (y_pred == y_test).mean()

print("Labels:", labels)

print("Accuracy: \n\n", accuracy)

Labels: [0 1]
Accuracy: 

 related                   0.809277
request                   0.895636
offer                     0.995575
aid_related               0.738938
medical_help              0.918523
medical_products          0.952243
search_and_rescue         0.973604
security                  0.981538
military                  0.968416
child_alone               1.000000
water                     0.950565
food                      0.940342
shelter                   0.935002
clothing                  0.985658
money                     0.977724
missing_people            0.989472
refugees                  0.967653
death                     0.959872
other_aid                 0.874123
infrastructure_related    0.936070
transport                 0.956363
buildings                 0.954989
electricity               0.979860
tools                     0.993287
hospitals                 0.990693
shops                     0.995117
aid_centers               0.987946
other_infrastructure      0.

from sklearn.base import BaseEstimator, TransformerMixin
from custom_transformer import StartingVerbExtractor

feature_pipeline = Pipeline([
        ('features', FeatureUnion([

            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
            ])),

            ('starting_verb', StartingVerbExtractor())
        ])),

        ('clf', MultiOutputClassifier(RandomForestClassifier(n_jobs = -1)))
    ])


### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [15]:
# Check f1 score:

print(classification_report(y_test,y_pred,target_names = df.columns[4:]))


                        precision    recall  f1-score   support

               related       0.85      0.92      0.88      5030
               request       0.79      0.47      0.59      1051
                 offer       0.00      0.00      0.00        29
           aid_related       0.73      0.57      0.64      2673
          medical_help       0.55      0.07      0.12       540
      medical_products       0.68      0.10      0.18       331
     search_and_rescue       0.53      0.05      0.08       174
              security       0.50      0.01      0.02       121
              military       0.68      0.09      0.16       217
           child_alone       0.00      0.00      0.00         0
                 water       0.81      0.29      0.43       418
                  food       0.80      0.60      0.69       712
               shelter       0.79      0.29      0.42       540
              clothing       0.69      0.17      0.28       104
                 money       0.40      

### 6. Improve your model
Use grid search to find better parameters. 

In [16]:
pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'vect', 'tfidf', 'clf', 'vect__analyzer', 'vect__binary', 'vect__decode_error', 'vect__dtype', 'vect__encoding', 'vect__input', 'vect__lowercase', 'vect__max_df', 'vect__max_features', 'vect__min_df', 'vect__ngram_range', 'vect__preprocessor', 'vect__stop_words', 'vect__strip_accents', 'vect__token_pattern', 'vect__tokenizer', 'vect__vocabulary', 'tfidf__norm', 'tfidf__smooth_idf', 'tfidf__sublinear_tf', 'tfidf__use_idf', 'clf__estimator__bootstrap', 'clf__estimator__class_weight', 'clf__estimator__criterion', 'clf__estimator__max_depth', 'clf__estimator__max_features', 'clf__estimator__max_leaf_nodes', 'clf__estimator__min_impurity_decrease', 'clf__estimator__min_impurity_split', 'clf__estimator__min_samples_leaf', 'clf__estimator__min_samples_split', 'clf__estimator__min_weight_fraction_leaf', 'clf__estimator__n_estimators', 'clf__estimator__n_jobs', 'clf__estimator__oob_score', 'clf__estimator__random_state', 'clf__estimator__verbose', 'clf__estimator__

In [17]:
# select parameters for GridSearchCV:

parameters = {
     'clf__estimator__n_estimators': [5]
    }

cv = GridSearchCV(pipeline, param_grid=parameters, cv = 3)

cv = GridSearchCV(
    pipeline, 
    param_grid=parameters,
    cv=3,
    scoring=avg_accuracy_cv, 
    verbose=3)


In [18]:
# Fit model:

model = cv.fit(X_train, y_train)

In [19]:
# Predict using test data:

y_predict = model.predict(X_test)

In [20]:
# check rows n columns:

y_predict.shape, y_test.shape, len(list(Y.columns))

((6554, 36), (6554, 36), 36)

In [21]:
# Check accuracy of the model:

labels = np.unique(y_predict)

accuracy = (y_predict == y_test).mean()

print("Labels:", labels)

print("Accuracy: \n\n", accuracy)

Labels: [0 1]
Accuracy: 

 related                   0.799817
request                   0.881141
offer                     0.995575
aid_related               0.727495
medical_help              0.919286
medical_products          0.950107
search_and_rescue         0.972536
security                  0.980623
military                  0.966585
child_alone               1.000000
water                     0.948123
food                      0.927678
shelter                   0.933476
clothing                  0.986115
money                     0.975893
missing_people            0.989319
refugees                  0.968264
death                     0.956820
other_aid                 0.865121
infrastructure_related    0.933171
transport                 0.955752
buildings                 0.954226
electricity               0.978944
tools                     0.992981
hospitals                 0.989930
shops                     0.995117
aid_centers               0.987641
other_infrastructure      0.

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [22]:
# check f1 score:

print(classification_report(y_test,y_predict,target_names = df.columns[4:]))


                        precision    recall  f1-score   support

               related       0.84      0.91      0.88      5030
               request       0.68      0.49      0.57      1051
                 offer       0.00      0.00      0.00        29
           aid_related       0.67      0.64      0.66      2673
          medical_help       0.53      0.16      0.25       540
      medical_products       0.53      0.10      0.17       331
     search_and_rescue       0.44      0.13      0.20       174
              security       0.12      0.01      0.02       121
              military       0.48      0.12      0.20       217
           child_alone       0.00      0.00      0.00         0
                 water       0.76      0.27      0.40       418
                  food       0.73      0.52      0.61       712
               shelter       0.70      0.34      0.46       540
              clothing       0.72      0.20      0.32       104
                 money       0.29      

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

In [23]:
filename = 'random_forest_classifier_model.pkl'
pickle.dump(pipeline, open(filename, 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.