# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [2]:
# import libraries
from sqlalchemy import create_engine
import numpy as np
import pandas as pd

from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import classification_report,confusion_matrix, precision_score,\
recall_score,accuracy_score,  f1_score,  make_scorer
from sklearn.base import BaseEstimator, TransformerMixin
import nltk

from nltk import word_tokenize

import pickle

In [63]:
def load_data():
    # load data from database
    #engine = create_engine('sqlite:///DisasterResponse_new.db')
    #df = pd.read_sql("SELECT * FROM DisasterResponse_new", engine)
    df = pd.read_csv("DisasterResponse_new.csv")
    X = df.message
    y = df.loc[:,"related":"direct_report"]
    category_names=y.columns
    return X, y,category_names

Use the first five messages as a sample to take a look at the data

In [64]:
X,y,category_names=load_data()
print(X[:5])
y.head(5)

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
Name: message, dtype: object


  if (await self.run_code(code, result,  async_=asy)):


Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Normalize text data

A sequence of functions used to  clean up HTML markups, expand contractions, stem and lemmatize, remove special characters, get rid of stop words, and remove accents from characters, etc. is defined in the notebook called Text_Normalization_Function. Run the notebook and the functions will be available in this notebook

In [13]:
%run ./Text_Normalization_Function.ipynb

Processing c:\users\nsun9\appdata\local\pip\cache\wheels\4f\85\2a\67a30aa6cf144eca0c159f337ce5166df2213c4cde9e699cbe\html_parser-0.2-py3-none-any.whl
Installing collected packages: html.parser
Successfully installed html.parser


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nsun9\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\nsun9\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nsun9\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nsun9\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


The normalize_corpus can be used as a customized preprocessor in CountVectorizer.\
**preprocessor** should be a callable, default=None. Override the preprocessing (strip_accents and lowercase) stage while preserving the tokenizing and n-grams generation steps. It should return a text **(not a series or list)**. 

However, if a function is used to normalize the corpus before feeding to CountVectorizer, the function should return a series or list.

Use the first five messages as a sample to take a look at result after CountVectorizer

In [14]:
bow_vectorizer = CountVectorizer(preprocessor=normalize_corpus)
NORM_corpus_train_bow = bow_vectorizer.fit_transform(X[:5])
NORM_corpus_train_bow_table= pd.DataFrame(data = NORM_corpus_train_bow.todense(),
                                           columns = bow_vectorizer.get_feature_names())
NORM_corpus_train_bow_table.head()

Unnamed: 0,cold,could,country,croix,cuba,desperately,destroyed,front,functioning,haiti,...,side,someone,st,supply,today,tonight,un,update,weather,west
0,1,1,0,0,1,0,0,1,0,1,...,0,0,0,0,0,0,0,1,1,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,0,0,0,1,0,1,1,0,1,0,...,0,0,1,1,0,0,1,0,0,0
4,0,0,1,0,0,0,0,0,0,1,...,1,0,0,0,1,1,0,0,0,1


### 3. Add other features besides the TF-IDF

Other characteristics of the text, such as length, may also affect the results. I defined a function to count the number of tokens contained in the text

In [27]:
class Text_Length_Extractor(BaseEstimator, TransformerMixin):
    def get_length(self, text):
        length=len(word_tokenize(text))
        return length
    
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_length = pd.Series(X).apply(self.get_length)
        # In order to use FeatureUnion to combine the Text_Length_Extractor with the text_pipeline,
        # We must convert X_length into a dataframe. Otherwise, ValueError: blocks[0,:] has incompatible row dimensions. 
        return pd.DataFrame(X_length)

### 4. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [28]:
pipeline = Pipeline([
    ('features', FeatureUnion([
        ('text_pipeline',Pipeline([
            ('vect', CountVectorizer(preprocessor=normalize_corpus)),
            ('tfidf', TfidfTransformer())
        ])),
        ('text_length',Text_Length_Extractor())
    ])),
        
    ('clf', MultiOutputClassifier(estimator=RandomForestClassifier(random_state=42)))
    ])

### 5. Train pipeline
- Split data into train and test sets
- Train pipeline

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

### 6. Test your model
Report the f1 score, precision and recall for each output category of the dataset.

The y_pred is a numpy array with a shape of (6554, 36), so we have to access it by referring to its index number.

In [49]:
print(y_pred)
print(y_pred[:,0])

[[1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 ...
 [1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]]
[1 1 1 ... 1 1 1]


The y_test is a pd dataframe, if we want to access it by referring to its column number, we can use df.iloc, integer-location based indexing 

In [45]:
y_test.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
12720,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,1
10777,1,1,0,1,0,0,0,0,1,0,...,0,0,1,0,0,0,0,1,0,0
1273,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
12457,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
11980,1,0,0,1,1,0,1,1,0,0,...,0,0,1,1,0,0,0,0,0,0


Parameter average: required for multiclass/multilabel targets. 

Binary:Only report results for the class specified by pos_label (default is 1).

Macro average (averaging the unweighted mean per label), weighted average (averaging the support-weighted mean per label).

Take F1 score as an example:

Macro F1 calculates the F1 separated by class but not using weights for the aggregation: F1class1+F1class2+⋅⋅⋅+F1classN, which resuls in a bigger penalisation when the model does not perform well with the minority classes(when there is imbalance)

Weighted F1 score calculates the F1 score for each class independently but when it adds them together uses a weight that depends on the number of true labels of each class: F1class1∗W1+F1class2∗W2+⋅⋅⋅+F1classN∗WN.Therefore favouring the majority class

In [52]:
print(classification_report(y_test.iloc[:,0], y_pred[:,0]))

              precision    recall  f1-score   support

           0       0.70      0.38      0.50      1566
           1       0.83      0.95      0.89      4988

    accuracy                           0.81      6554
   macro avg       0.76      0.67      0.69      6554
weighted avg       0.80      0.81      0.79      6554



In this project, I use the default average parameter, binary. The recall and precision for some small categories such as offer and child alone are almost zero. The classifier classified almost everything as 0 due to an imbalance in the training data

Unlike the common problem with only one column of y, this project has 36 columns of y. In order to evaluate the prediction of each column, I use for loop

In [66]:
metrics_list_all=[]
for col in range(y_test.shape[1]):
    accuracy = accuracy_score(y_test.iloc[:,col], y_pred[:,col])
    precision=precision_score(y_test.iloc[:,col], y_pred[:,col])
    recall = recall_score(y_test.iloc[:,col], y_pred[:,col])
    f_1 = f1_score(y_test.iloc[:,col], y_pred[:,col])
    metrics_list=[accuracy,precision,recall,f_1]
    metrics_list_all.append(metrics_list)
metrics_df=pd.DataFrame(metrics_list_all,index=category_names,columns=["Accuracy","Precision","Recall","F_1"])
print(metrics_df)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(


                        Accuracy  Precision    Recall       F_1
related                 0.813244   0.830639  0.947875  0.885393
request                 0.897315   0.837446  0.507030  0.631637
offer                   0.995728   0.000000  0.000000  0.000000
aid_related             0.778761   0.741923  0.707761  0.724439
medical_help            0.925084   0.569231  0.074000  0.130973
medical_products        0.959872   0.763158  0.102473  0.180685
search_and_rescue       0.974825   0.769231  0.058140  0.108108
security                0.980012   0.500000  0.007634  0.015038
military                0.967958   0.666667  0.064516  0.117647
child_alone             1.000000   0.000000  0.000000  0.000000
water                   0.956668   0.855000  0.401408  0.546326
food                    0.943241   0.843750  0.615282  0.711628
shelter                 0.943546   0.840909  0.446552  0.583333
clothing                0.984284   0.800000  0.073394  0.134454
money                   0.975740   0.500

If I calculate the accuracy score directly, it will give back very weird result

In [55]:
accuracy_score(y_test.values, y_pred),pipeline.score(X_test,y_test)

(0.25587427525175466, 0.25587427525175466)

However, if use reshape to flatten the data from having 36 different columns to 1 column (appending data of each column one after the other), the result will be the same as using for loop to calculate the accuracy score of each column and then calculate the average

numpy.reshape(a, newshape, order='C') gives a new shape to an array without changing its data.

In [56]:
accuracy_score(y_test.values.reshape(-1,1), y_pred.reshape(-1,1))

0.9496278778015123

In [68]:
print(("The average accuracy score among all categories is {:.4f},\nthe average precision score score among all categories is {:.4f},\nthe average recall score among all categories is {:.4f},\nthe average F 1 score among all categories is {:.4f}").format(metrics_df.mean()["Accuracy"],metrics_df.mean()["Precision"],metrics_df.mean()["Recall"],metrics_df.mean()["F_1"]))

The average accuracy score among all categories is 0.9496,
the average precision score score among all categories is 0.6161,
the average recall score among all categories is 0.2089,
the average F 1 score among all categories is 0.2582


### 7. Improve your model
Use grid search to find better parameters. 

In [36]:
# Define a score used in scoring parameter
def avg_accuracy(y_test, y_pred):
    """
    This is the score_func used in make_scorer, which would be used in in GridSearchCV 
    """
    avg_accuracy=accuracy_score(y_test.values.reshape(-1,1), y_pred.reshape(-1,1))
    
    return avg_accuracy
avg_accuracy_cv = make_scorer(avg_accuracy)

In [38]:
# Take a look at what parameters are available to be tuned
list(pipeline.get_params())

['memory',
 'steps',
 'verbose',
 'features',
 'clf',
 'features__n_jobs',
 'features__transformer_list',
 'features__transformer_weights',
 'features__verbose',
 'features__text_pipeline',
 'features__text_length',
 'features__text_pipeline__memory',
 'features__text_pipeline__steps',
 'features__text_pipeline__verbose',
 'features__text_pipeline__vect',
 'features__text_pipeline__tfidf',
 'features__text_pipeline__vect__analyzer',
 'features__text_pipeline__vect__binary',
 'features__text_pipeline__vect__decode_error',
 'features__text_pipeline__vect__dtype',
 'features__text_pipeline__vect__encoding',
 'features__text_pipeline__vect__input',
 'features__text_pipeline__vect__lowercase',
 'features__text_pipeline__vect__max_df',
 'features__text_pipeline__vect__max_features',
 'features__text_pipeline__vect__min_df',
 'features__text_pipeline__vect__ngram_range',
 'features__text_pipeline__vect__preprocessor',
 'features__text_pipeline__vect__stop_words',
 'features__text_pipeline__ve

In [41]:
parameters = parameters = {
    #'features__text_pipeline__vect__ngram_range': ((1, 1), (1, 2)),  
    'clf__estimator__max_depth': [15, 30],  
    'clf__estimator__n_estimators': [100, 250]}

cv = GridSearchCV(
    pipeline, 
    param_grid=parameters,
    cv=3,
    scoring=avg_accuracy_cv, 
    verbose=3)

cv.fit(X_train, y_train)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] clf__estimator__max_depth=15, clf__estimator__n_estimators=100 ..


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  clf__estimator__max_depth=15, clf__estimator__n_estimators=100, score=0.929, total= 1.2min
[CV] clf__estimator__max_depth=15, clf__estimator__n_estimators=100 ..


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.2min remaining:    0.0s


[CV]  clf__estimator__max_depth=15, clf__estimator__n_estimators=100, score=0.929, total= 1.2min
[CV] clf__estimator__max_depth=15, clf__estimator__n_estimators=100 ..


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  2.3min remaining:    0.0s


[CV]  clf__estimator__max_depth=15, clf__estimator__n_estimators=100, score=0.928, total= 1.2min
[CV] clf__estimator__max_depth=15, clf__estimator__n_estimators=250 ..
[CV]  clf__estimator__max_depth=15, clf__estimator__n_estimators=250, score=0.929, total= 1.6min
[CV] clf__estimator__max_depth=15, clf__estimator__n_estimators=250 ..
[CV]  clf__estimator__max_depth=15, clf__estimator__n_estimators=250, score=0.929, total= 1.7min
[CV] clf__estimator__max_depth=15, clf__estimator__n_estimators=250 ..
[CV]  clf__estimator__max_depth=15, clf__estimator__n_estimators=250, score=0.928, total= 1.7min
[CV] clf__estimator__max_depth=30, clf__estimator__n_estimators=100 ..
[CV]  clf__estimator__max_depth=30, clf__estimator__n_estimators=100, score=0.933, total= 1.8min
[CV] clf__estimator__max_depth=30, clf__estimator__n_estimators=100 ..
[CV]  clf__estimator__max_depth=30, clf__estimator__n_estimators=100, score=0.933, total= 1.9min
[CV] clf__estimator__max_depth=30, clf__estimator__n_estimators

[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed: 23.0min finished


GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('features',
                                        FeatureUnion(transformer_list=[('text_pipeline',
                                                                        Pipeline(steps=[('vect',
                                                                                         CountVectorizer(preprocessor=<function normalize_corpus at 0x000001601B0118B0>)),
                                                                                        ('tfidf',
                                                                                         TfidfTransformer())])),
                                                                       ('text_length',
                                                                        Text_Length_Extractor())])),
                                       ('clf',
                                        MultiOutputClassifier(estimator=RandomForestClassifier(random_state=42)))]),
         

### 8. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [73]:
def evaluate_model(model, X_test, y_test,category_names):
    """
    The evaluate_model function will return the accuracy, precision, and recall, and f1 scores for each output category of the dataset.

    INPUTS:
        model- a trained model for evaluation
        X_test - a panda data frame or Numpy array, contains the untouched values of features. 
        y_pred - a Numpy array, contains predicted category values of the messages. 
        
    OUTPUT:
    metrics_df, a panda dataframe that contains accuracy, precision, and recall, and f1 scores for each output category of the dataset.
    """
    y_pred=model.predict(X_test)
    metrics_list_all=[]
    for col in range(y_test.shape[1]):
        accuracy = accuracy_score(y_test.iloc[:,col], y_pred[:,col])
        precision=precision_score(y_test.iloc[:,col], y_pred[:,col])
        recall = recall_score(y_test.iloc[:,col], y_pred[:,col])
        f_1 = f1_score(y_test.iloc[:,col], y_pred[:,col])
        metrics_list=[accuracy,precision,recall,f_1]
        metrics_list_all.append(metrics_list)
    metrics_df=pd.DataFrame(metrics_list_all,index=category_names,columns=["Accuracy","Precision","Recall","F_1"])
    print(metrics_df)
    print("----------------------------------------------------------------------")
    print(("The average accuracy score among all categories is {:.4f},\nthe average precision score score among all categories is {:.4f},\nthe average recall score among all categories is {:.4f},\nthe average F 1 score among all categories is {:.4f}").format(metrics_df.mean()["Accuracy"],metrics_df.mean()["Precision"],metrics_df.mean()["Recall"],metrics_df.mean()["F_1"]))
    
    return None

In [74]:
# Get the best model and store it as best_randomforest
best_randomforest=cv.best_estimator_

In [75]:
evaluate_model(best_randomforest, X_test, y_test,category_names)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(


                        Accuracy  Precision    Recall       F_1
related                 0.761520   0.761570  0.999599  0.864499
request                 0.839335   0.967033  0.077329  0.143206
offer                   0.995728   0.000000  0.000000  0.000000
aid_related             0.727037   0.857030  0.402896  0.548118
medical_help            0.923711   0.500000  0.002000  0.003984
medical_products        0.956973   1.000000  0.003534  0.007042
search_and_rescue       0.973756   0.000000  0.000000  0.000000
security                0.980012   0.000000  0.000000  0.000000
military                0.966890   0.000000  0.000000  0.000000
child_alone             1.000000   0.000000  0.000000  0.000000
water                   0.935154   1.000000  0.002347  0.004684
food                    0.887244   1.000000  0.009383  0.018592
shelter                 0.912420   1.000000  0.010345  0.020478
clothing                0.983522   1.000000  0.009174  0.018182
money                   0.975740   0.000

### 9. Export your model as a pickle file

**Pickle** is the standard way of serializing objects in Python.You can use the pickle operation to serialize your machine learning algorithms and save the serialized format to a file.

Later you can load this file to deserialize your model and use it to make new predictions.

In [62]:
filename = 'best_randomforest.pkl'
pickle.dump(pipeline, open(filename, 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.