# ML pipeline

## Procedures:

1. [**Loading from database**](#1)
2. [**Normalization, Tokenization, remove stop words and Lemmatization**](#2)
3. [**Find the best performance classifier**](#3)
4. [**Tuning hyper parameters by GridSearchCV**](#4)
5. [**Statistical report**](#5)
6. [**Saving model as pickle file**](#6)
7. [**Testing**](#7)

<a id="1"></a>
## Load data

In [2]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

nltk.download(['punkt', 'stopwords', 'wordnet'])

import pandas as pd
import re
from sqlalchemy import create_engine

# load from database table
engine = create_engine('sqlite:///../data/Disaster_database.db')
df = pd.read_sql_table('overall', con = engine)
df.head()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\EllenChen\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\EllenChen\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\EllenChen\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,id,message,original,genre,related,request,offer,aid related,medical help,medical products,...,aid centers,other infrastructure,weather related,floods,storm,fire,earthquake,cold,other weather,direct report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [2]:
# check values unique numbers
df.nunique()

id                        26180
message                   26177
original                   9630
genre                         3
related                       3
request                       2
offer                         2
aid related                   2
medical help                  2
medical products              2
search and rescue             2
security                      2
military                      2
child alone                   1
water                         2
food                          2
shelter                       2
clothing                      2
money                         2
missing people                2
refugees                      2
death                         2
other aid                     2
infrastructure related        2
transport                     2
buildings                     2
electricity                   2
tools                         2
hospitals                     2
shops                         2
aid centers                   2
other in

In [11]:
# check value number of related
df.related.value_counts()

1    20094
0     6122
Name: related, dtype: int64

In [3]:
# replace 2 as 1
df.related = df.related.map(lambda x: '1' if x == '2' else x)
df.describe(include='O')

Unnamed: 0,message,original,genre,related,request,offer,aid related,medical help,medical products,search and rescue,...,aid centers,other infrastructure,weather related,floods,storm,fire,earthquake,cold,other weather,direct report
count,26216,10170,26216,26216,26216,26216,26216,26216,26216,26216,...,26216,26216,26216,26216,26216,26216,26216,26216,26216,26216
unique,26177,9630,3,2,2,2,2,2,2,2,...,2,2,2,2,2,2,2,2,2,2
top,#NAME?,Nap fe ou konnen ke apati de jodi a sevis SMS ...,news,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
freq,4,20,13054,20094,21742,26098,15356,24132,24903,25492,...,25907,25065,18919,24061,23773,25934,23761,25686,24840,21141


<a id="2"></a>
## Tokenization

### Procedures:
1. Detecting any url like text, replace as urlplaceholder
2. Replacing punctuation with empty space
3. Tokenizing lower case text
4. Removing stop words
5. Lemmatizing text as new tokens

In [5]:
# replace url address as urlplaceholder
# tokenize and lemmatize text

url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

def tokenize(text):
    # find url text and replace as urlplace text
    detect_url = re.findall(url_regex, text)
    for url in detect_url:
        text = text.replace(url, 'urlplaceholder')
        
    # norlamized and tokenized
    texts = word_tokenize(re.sub('[^a-zA-Z0-9]', ' ', text.lower()))
    # remove stopwords
    words = [w for w in texts if w not in stopwords.words('english')]
    # tokenize and lemmatizer text
    clean_tokens = [WordNetLemmatizer().lemmatize(word) for word in words]
    
    return clean_tokens

### Define independent and targt variables

In [4]:
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split

# define and split independent and target variable
X = df.message.values
Y = df[df.columns[4:]].values

X_train, X_test, y_train, y_test = train_test_split(X, Y)

<a id="3"></a>
## Trying to find the best performance classifier

Using **Pipeline** for automatically operating each steps, due to this case has multilabels, thus, **MultiOutputClassifier** can help to avoid error. Setting a new variable called models for different classifiers, combining with **for loop** to find out the best one.

In [9]:
# use for loop to check which classifier has the highest accuracy
# use multioutputclassifier for multiclass - multioutput case

RC = RandomForestClassifier(random_state=42)
DTC = DecisionTreeClassifier(random_state=42)
KNC = KNeighborsClassifier()

models = [RC, DTC, KNC]

for model in models:
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(model))])
    
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    accuracy = (y_test == y_pred).mean()
    print(model)
    print(accuracy)

RandomForestClassifier(random_state=42)
0.9484708235852575
DecisionTreeClassifier(random_state=42)
0.9322381582070322
KNeighborsClassifier()
0.928249923710711


In [16]:
model = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier(random_state=42,
                                                             n_estimators=200)))])

# inspect the entire hyper parameters of model
model.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(tokenizer=<function tokenize at 0x000001A0B62C0430>)),
  ('tfidf', TfidfTransformer()),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(n_estimators=200,
                                                          random_state=42)))],
 'verbose': False,
 'vect': CountVectorizer(tokenizer=<function tokenize at 0x000001A0B62C0430>),
 'tfidf': TfidfTransformer(),
 'clf': MultiOutputClassifier(estimator=RandomForestClassifier(n_estimators=200,
                                                        random_state=42)),
 'vect__analyzer': 'word',
 'vect__binary': False,
 'vect__decode_error': 'strict',
 'vect__dtype': numpy.int64,
 'vect__encoding': 'utf-8',
 'vect__input': 'content',
 'vect__lowercase': True,
 'vect__max_df': 1.0,
 'vect__max_features': None,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 1),
 'vect__preprocessor': None,
 'vect__stop_words': None,
 'vect__strip_accents': None,
 'vect__token_pattern': '(

<a id="4"></a>
## Tuning hyper parameters

Using **GridSearchCV** and self setting parameters in dict type, calling **best_params_** method to check the best parameters after train data.

In [13]:
from sklearn.model_selection import GridSearchCV

# custom hyper parameter in dict format
parameters = {'vect__max_features': [None, 5, 10],
              'clf__estimator__n_estimators': [100, 200],
              'clf__estimator__max_samples': [None, 10, 20]}

cv = GridSearchCV(model, parameters)
cv.fit(X_train, y_train)

# fine the best performance hyper parameters
best_params = cv.best_params_
print(best_params)

{'clf__estimator__max_samples': None, 'clf__estimator__n_estimators': 200, 'vect__max_features': None}


<a id="5"></a>
## Statistical report

- Using **confusion matrix** for matching statistical numbers of each label.
- Using **classification report** for precision, recall and f1 score of each label. 

In [22]:
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report

# extract y_test and y_pred by 35 different labels and append to new list variables
true = []
pred = []

for i in range(35):
    true.append(y_test[:,i])
    pred.append(y_pred[:,i])

col_name = df.columns[4:].values

# output each label's confusion matrix and classification report 
for t, p, col in zip(true, pred, col_name):
    labels = np.unique(t)
    confusion_mat = confusion_matrix(t, p, labels= labels)
    label_accuracy = (t == p).mean()

    print('Labels Name:', col.upper())
    print('Labels:', labels)
    print('Confusion matrix of each label:\n', confusion_mat, '\n')
    print('{} accuracy:'.format(col.upper()), label_accuracy, '\n')
    print('Classification Report: \n', classification_report(t, p), '\n')

Labels Name: RELATED
Labels: ['0' '1']
Confusion matrix of each label:
 [[ 629  921]
 [ 261 4743]] 

RELATED accuracy: 0.8196521208422337 

Classification Report: 
               precision    recall  f1-score   support

           0       0.71      0.41      0.52      1550
           1       0.84      0.95      0.89      5004

    accuracy                           0.82      6554
   macro avg       0.77      0.68      0.70      6554
weighted avg       0.81      0.82      0.80      6554
 

Labels Name: REQUEST
Labels: ['0' '1']
Confusion matrix of each label:
 [[5342  116]
 [ 540  556]] 

REQUEST accuracy: 0.8999084528532194 

Classification Report: 
               precision    recall  f1-score   support

           0       0.91      0.98      0.94      5458
           1       0.83      0.51      0.63      1096

    accuracy                           0.90      6554
   macro avg       0.87      0.74      0.79      6554
weighted avg       0.89      0.90      0.89      6554
 

Labels Name:

  _warn_prf(average, modifier, msg_start, len(result))


Classification Report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      6523
           1       0.00      0.00      0.00        31

    accuracy                           1.00      6554
   macro avg       0.50      0.50      0.50      6554
weighted avg       0.99      1.00      0.99      6554
 

Labels Name: AID RELATED
Labels: ['0' '1']
Confusion matrix of each label:
 [[3223  616]
 [ 798 1917]] 

AID RELATED accuracy: 0.7842538907537382 

Classification Report: 
               precision    recall  f1-score   support

           0       0.80      0.84      0.82      3839
           1       0.76      0.71      0.73      2715

    accuracy                           0.78      6554
   macro avg       0.78      0.77      0.78      6554
weighted avg       0.78      0.78      0.78      6554
 

Labels Name: MEDICAL HELP
Labels: ['0' '1']
Confusion matrix of each label:
 [[6004    8]
 [ 506   36]] 

MEDICAL HELP accuracy: 0.9215746109246

Labels Name: INFRASTRUCTURE RELATED
Labels: ['0' '1']
Confusion matrix of each label:
 [[6119    4]
 [ 430    1]] 

INFRASTRUCTURE RELATED accuracy: 0.9337808971620385 

Classification Report: 
               precision    recall  f1-score   support

           0       0.93      1.00      0.97      6123
           1       0.20      0.00      0.00       431

    accuracy                           0.93      6554
   macro avg       0.57      0.50      0.49      6554
weighted avg       0.89      0.93      0.90      6554
 

Labels Name: TRANSPORT
Labels: ['0' '1']
Confusion matrix of each label:
 [[6227   12]
 [ 287   28]] 

TRANSPORT accuracy: 0.9543790051876716 

Classification Report: 
               precision    recall  f1-score   support

           0       0.96      1.00      0.98      6239
           1       0.70      0.09      0.16       315

    accuracy                           0.95      6554
   macro avg       0.83      0.54      0.57      6554
weighted avg       0.94      0.95  

<a id="6"></a>
## Saving model

Storing trained and optimized model as a new **pickle file**.

In [24]:
import pickle

# save model 
pickle.dump(cv, open('disaster_model.pkl', 'wb'))

# load model
loaded_model = pickle.load(open('disaster_model.pkl', 'rb'))

# check accuracy
y_pred = loaded_model.predict(X_test)
accuracy = (y_test == y_pred).mean()
print(accuracy)

0.9488183636795172


In [10]:
# load model
loaded_model = pickle.load(open('disaster_model.pkl', 'rb'))

# check accuracy
y_pred = loaded_model.predict(X_test)
accuracy = (y_test == y_pred).mean()
print(accuracy)

0.9864586512053708


<a id='7'></a>
## Test pipeline output

Loading saved **pickle file** as a new model variable and test model performance

In [8]:
# load model
loaded_model = pickle.load(open('disaster_model.pkl', 'rb'))

# check accuracy
y_pred = loaded_model.predict(X_test)
accuracy = (y_test == y_pred).mean()
print(accuracy)

0.9864586512053708
