# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [27]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import re
import sqlite3

import sqlalchemy
from sqlalchemy import create_engine

from sklearn import multioutput
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import SGDClassifier, RidgeClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, fbeta_score, make_scorer, confusion_matrix

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download(['punkt', 'wordnet', 'stopwords'])


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse_YG.db')
df = pd.read_sql_table("disaster_response_messages", con=engine)
X = df['message']
y = df.iloc[:, 4:]

### 2. Write a tokenization function to process your text data

In [4]:
def tokenize(text):
    # normalize text
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    # extract the word tokens
    tokens = nltk.word_tokenize(text)
    # Lemmanitizer
    lemmatizer = nltk.WordNetLemmatizer()
    # List of clean tokens
    token_lemmed = [lemmatizer.lemmatize(w).lower().strip() for w in tokens]
    return token_lemmed

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [5]:
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', multioutput.MultiOutputClassifier (RandomForestClassifier()))
        ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [6]:
#train the model
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
pipeline_fitted = pipeline.fit(X_train, y_train)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [16]:
#use mean as accuracy for that classification_report is too long and not obvious to most of people
y_pred = pipeline.predict (X_test)
accuracy = (y_pred == y_test).mean()
accuracy

related                   0.794477
request                   0.878090
offer                     0.996033
aid_related               0.725664
medical_help              0.923711
medical_products          0.952090
search_and_rescue         0.971926
security                  0.981691
military                  0.969332
child_alone               1.000000
water                     0.955447
food                      0.932103
shelter                   0.924168
clothing                  0.985658
money                     0.977876
missing_people            0.988862
refugees                  0.969484
death                     0.958804
other_aid                 0.870613
infrastructure_related    0.931950
transport                 0.959414
buildings                 0.951633
electricity               0.977876
tools                     0.994507
hospitals                 0.989167
shops                     0.996033
aid_centers               0.987489
other_infrastructure      0.953616
weather_related     

### 6. Improve your model
Use grid search to find better parameters. 

In [11]:
#try parameter searching, but too slow..
parameters = {
                'clf__estimator__n_estimators': [10, 20, 40],
                'clf__estimator__min_samples_split': [2, 5]
                'clf__estimator__learning_rate': [0.1, 0.5]
              }
cv = GridSearchCV (pipeline,param_grid=parameters)
cv.fit(X_train,y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'clf__estimator__n_estimators': [10, 20], 'clf__estimator__min_samples_split': [2, 5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [15]:
#as above
y_pred_cv = cv.predict(X_test)
accuracy = (y_pred_cv == y_test).mean()
accuracy

related                   0.803784
request                   0.888007
offer                     0.996033
aid_related               0.745194
medical_help              0.923558
medical_products          0.951633
search_and_rescue         0.971773
security                  0.981385
military                  0.970095
child_alone               1.000000
water                     0.947360
food                      0.919439
shelter                   0.927983
clothing                  0.985352
money                     0.977724
missing_people            0.988862
refugees                  0.968874
death                     0.958041
other_aid                 0.872139
infrastructure_related    0.932713
transport                 0.960330
buildings                 0.952548
electricity               0.977724
tools                     0.994507
hospitals                 0.989167
shops                     0.996033
aid_centers               0.987489
other_infrastructure      0.953921
weather_related     

better but comparing to the time cost, not efficient!

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [29]:
# try applying RidgeClassifierCV as classifier
new_pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', multioutput.MultiOutputClassifier (RidgeClassifier()))
        ])
new_pipeline.fit(X_train, y_train)


Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...=None, normalize=False, random_state=None, solver='auto',
        tol=0.001),
           n_jobs=1))])

In [34]:
y_pred_ridgeCV = new_pipeline.predict (X_test)
accuracy = (y_pred_ridgeCV == y_test).mean()
accuracy

related                   0.816601
request                   0.903113
offer                     0.996033
aid_related               0.774794
medical_help              0.928288
medical_products          0.956515
search_and_rescue         0.973146
security                  0.981691
military                  0.972383
child_alone               1.000000
water                     0.961245
food                      0.947208
shelter                   0.946445
clothing                  0.988862
money                     0.978334
missing_people            0.989472
refugees                  0.970400
death                     0.968874
other_aid                 0.874428
infrastructure_related    0.933628
transport                 0.962771
buildings                 0.962618
electricity               0.980165
tools                     0.994507
hospitals                 0.989014
shops                     0.996033
aid_centers               0.987489
other_infrastructure      0.953769
weather_related     

Yeah, Better!

### 9. Export your model as a pickle file

In [33]:
pickle.dump(cv, open('model_cv.pkl', 'wb'))
pickle.dump(new_pipeline, open('model_ridgeCV.pkl', 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.