# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import numpy as np
import pandas as pd
from sqlalchemy import create_engine
import nltk
import re
import string
nltk.download(['punkt', 'wordnet', 'stopwords'])

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.figure_factory as ff
import plotly.graph_objs as go
import plotly.offline as py
py.init_notebook_mode(connected=True)
from IPython.display import display, HTML

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('messages_label', engine)

In [3]:
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
#determine X and Y dataset
X = df['message']
Y = df.drop(['id', 'message', 'original', 'genre'], axis=1)
columns_names = Y.columns.tolist()

### 2. Write a tokenization function to process your text data

In [6]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer 
# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

In [7]:
#check punctuation
remove_punc = str.maketrans('', '', string.punctuation)
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [8]:
#check stopwords 
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [9]:
stop_words = nltk.corpus.stopwords.words("english")
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
remove_punc_table = str.maketrans('', '', string.punctuation)

In [10]:
def tokenize(text):
    
    """
    tokenize text
    
    Parameters:
    text: Give text which is tokenized
    
    Returns:
    word: text split to word
    
    """
    #remove punctuation
    text = text.translate(remove_punc_table).lower()
    
    # tokenize text
    tokens = nltk.word_tokenize(text)
    
    # lemmatize and remove stop words
    return [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

Source: https://machinelearningmastery.com/clean-text-machine-learning-python/

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [11]:
from sklearn.datasets import make_multilabel_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

In [12]:
pipeline = Pipeline([
                    ('tfidf', TfidfVectorizer(tokenizer=tokenize)),
                    ('clf', MultiOutputClassifier(RandomForestClassifier()))
                    ])

I checked pipeline ;
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
#split data set to two subset ; train and test
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=42)

In [15]:
X_train.head()

19699    He blames deforestation - whether it is from i...
5728     I would like to know what to do to get a job, ...
14041    In addition to food shortages, health problems...
3576                    I'm a survivor and a student in.. 
9578     Some body tell me plateau central won't hit by...
Name: message, dtype: object

In [16]:
#create model
model=pipeline.fit(X_train, y_train)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [17]:
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score

In [18]:
#find model's score
pipeline.score(X_test, y_test)

0.15207373271889402

In [19]:
#find estimation results for test and train set
y_test_est = pipeline.predict(X_test)
y_train_est = pipeline.predict(X_train)

I find f1_score, recall, precision and accuracy for test set.

In [20]:
#According to test result calculate f1, recall and precision
f1_test=f1_score(y_test, y_test_est, average=None)
f1_df= pd.DataFrame(f1_test,index= Y.columns, columns=["f1_score"]).sort_values(by="f1_score", 
                                                                                ascending=False)

recall_test= recall_score(y_test, y_test_est, average=None)
recall_df= pd.DataFrame(recall_test,index= Y.columns, columns=["recall_score"]).sort_values(by="recall_score",
                                                                                              ascending=False)

precision_test= precision_score(y_test, y_test_est, average=None)
precision_df= pd.DataFrame(precision_test,index= Y.columns, columns=["precision_score"]).sort_values(by="precision_score",
                                                                                              ascending=False)



F-score is ill-defined and being set to 0.0 in labels with no predicted samples.


F-score is ill-defined and being set to 0.0 in labels with no true samples.


Recall is ill-defined and being set to 0.0 in labels with no true samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.



In [21]:
#According to train set calculate f1, recall and precision
f1_train=f1_score(y_train, y_train_est, average=None)
f1_df_train= pd.DataFrame(f1_train,index= Y.columns, columns=["f1_score"]).sort_values(by="f1_score", 
                                                                                ascending=False)

recall_train= recall_score(y_train, y_train_est, average=None)
recall_df_train= pd.DataFrame(recall_train,index= Y.columns, columns=["recall_score"]).sort_values(by="recall_score",
                                                                                              ascending=False)

precision_train= precision_score(y_train_est, y_train_est, average=None)
precision_df_train= pd.DataFrame(precision_train ,index= Y.columns, columns=["precision_score"]).sort_values(by="precision_score",
                                                                                              ascending=False)


F-score is ill-defined and being set to 0.0 in labels with no predicted samples.


F-score is ill-defined and being set to 0.0 in labels with no true samples.


Recall is ill-defined and being set to 0.0 in labels with no true samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.



In [26]:
#concat f1 score, recall and precision dataset
result_test = pd.concat([f1_df, recall_df, precision_df], axis=1, sort=False)

In [27]:
#test result for each columns
result_test

Unnamed: 0,f1_score,recall_score,precision_score
related,0.835342,0.913608,0.769428
aid_related,0.294531,0.220015,0.445372
weather_related,0.271813,0.186947,0.497791
earthquake,0.260979,0.168831,0.574586
request,0.149451,0.089631,0.449339
direct_report,0.126246,0.073988,0.429864
storm,0.091988,0.051839,0.407895
security,0.02963,0.016,0.2
other_aid,0.027027,0.01524,0.119266
shelter,0.018868,0.010256,0.117647


In [28]:
#graph for test/train results

def graph(x,y, x_title, y_title):
    
    """
    grahp for evaluation metrics 
    
    Parameters:
    x= x axis values
    y= y axis values
    x_title= x axis title
    y_title= y axis title
    
    Returns:
    graph
    

    """

    data=[]
    trace = go.Scatter(
     x = x,
     y = y,
    mode = 'markers+lines',
    marker = dict(
        size = 10,
        line = dict(
            width = 1,
            color = 'rgb(0, 0, 0)'
            
        )))

    plot1 = [trace]
    data=[trace]

    layout = {
      'xaxis': {'title': x_title},
      'yaxis': {'title': y_title},
      }
    

    return py.iplot({'data': data, "layout":layout})

In [29]:
#graph for f1 score 
x_title="class"
y_title="f1_score"
x=result_test.index
y= result_test["f1_score"]

graph(x,y, x_title, y_title)

In [30]:
#graph for recall_score
x_title="class"
y_title="recall"
x=result_test.index
y= result_test["recall_score"]

graph(x,y, x_title, y_title)

In [31]:
#graph for precision score
x_title="class"
y_title="precision_score"
x=result_test.index
y= result_test["precision_score"]

graph(x,y, x_title, y_title)


In [32]:
#concat f1 score, recall and precision dataset (train_result)
result_train = pd.concat([f1_df_train, recall_df_train, precision_df_train], axis=1, sort=False)

In [33]:
result_train

Unnamed: 0,f1_score,recall_score,precision_score
related,0.990346,0.997071,1.0
aid_related,0.963368,0.932988,1.0
weather_related,0.943688,0.897648,1.0
request,0.920299,0.857013,1.0
direct_report,0.91888,0.853333,1.0
earthquake,0.897398,0.821918,1.0
storm,0.885455,0.796185,1.0
food,0.883661,0.792661,1.0
other_aid,0.880208,0.788491,1.0
tools,0.869565,0.769231,1.0


In [34]:
def evaluation(model, y_test, X_test, column_names):
    
    """
    Calculate model's score; accuracy, precision, recall and f1 score
    
    Parameters
    model: determine model and model's parameters
    y_test: test set/categories
    X_test: test_set/messages-text
    column_names: categories names
    
    Returns
    df_score: dataframe/evaluation metrics
 
    """
    #create empty list
    score = []
    #find prediction results based on X_test set
    y_pred = np.array(model.predict(X_test))
    
    
    #find scores for each columns and append to list 
    for i in range(len(column_names)):
        accuracy = accuracy_score(y_test.iloc[:, i], y_pred[:, i])
        precision = precision_score(y_test.iloc[:, i], y_pred[:, i])
        recall = recall_score(y_test.iloc[:, i], y_pred[:, i])
        f1 = f1_score(y_test.iloc[:, i], y_pred[:, i])
        
        score.append([accuracy, precision, recall, f1])
    
    # Create dataframe 
    score = np.array(score)
    df_score = pd.DataFrame(data = score, index = column_names, columns = ['Accuracy', 'Precision', 'Recall', 'F1'])
    
    return print(df_score)

In [35]:
#check prediction result
y_pred = pipeline.predict(X_test)
y_pred= np.array(y_pred)

#get columns names
column_names = Y.columns
Y.columns.values

array(['related', 'request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
       'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers',
       'other_infrastructure', 'weather_related', 'floods', 'storm',
       'fire', 'earthquake', 'cold', 'other_weather', 'direct_report'], dtype=object)

In [36]:
#get evaluation metrics results
#accuracy, precision, recall and f1 score for each columns
df_score=evaluation(pipeline,y_test, X_test, column_names)


Precision is ill-defined and being set to 0.0 due to no predicted samples.


F-score is ill-defined and being set to 0.0 due to no predicted samples.



                        Accuracy  Precision    Recall        F1
related                 0.727189   0.769428  0.913608  0.835342
request                 0.821659   0.449339  0.089631  0.149451
offer                   0.995084   0.000000  0.000000  0.000000
aid_related             0.568049   0.445372  0.220015  0.294531
medical_help            0.915207   0.021277  0.001972  0.003610
medical_products        0.946851   0.000000  0.000000  0.000000
search_and_rescue       0.968049   0.000000  0.000000  0.000000
security                0.979877   0.200000  0.016000  0.029630
military                0.969278   0.200000  0.005076  0.009901
child_alone             1.000000   0.000000  0.000000  0.000000
water                   0.934255   0.055556  0.002427  0.004651
food                    0.880799   0.096774  0.008264  0.015228
shelter                 0.904147   0.117647  0.010256  0.018868
clothing                0.985561   0.000000  0.000000  0.000000
money                   0.976344   0.000

### 6. Improve your model
Use grid search to find better parameters. 

In [93]:
from sklearn.model_selection import GridSearchCV
parameters = {'clf__estimator__max_leaf_nodes': [3, 5],'clf__estimator__min_samples_leaf': [8, 10]}
#create grid search
grid_search = GridSearchCV(pipeline, parameters)
grid_search.fit(X_train, y_train)


#find best model's parameters
grid_search.best_params_
best_grid = grid_search.best_estimator_

Source:
#https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

In [94]:
#get parameters
sorted(pipeline.get_params().keys())

['clf',
 'clf__estimator',
 'clf__estimator__bootstrap',
 'clf__estimator__class_weight',
 'clf__estimator__criterion',
 'clf__estimator__max_depth',
 'clf__estimator__max_features',
 'clf__estimator__max_leaf_nodes',
 'clf__estimator__min_impurity_decrease',
 'clf__estimator__min_impurity_split',
 'clf__estimator__min_samples_leaf',
 'clf__estimator__min_samples_split',
 'clf__estimator__min_weight_fraction_leaf',
 'clf__estimator__n_estimators',
 'clf__estimator__n_jobs',
 'clf__estimator__oob_score',
 'clf__estimator__random_state',
 'clf__estimator__verbose',
 'clf__estimator__warm_start',
 'clf__n_jobs',
 'memory',
 'steps',
 'tfidf',
 'tfidf__analyzer',
 'tfidf__binary',
 'tfidf__decode_error',
 'tfidf__dtype',
 'tfidf__encoding',
 'tfidf__input',
 'tfidf__lowercase',
 'tfidf__max_df',
 'tfidf__max_features',
 'tfidf__min_df',
 'tfidf__ngram_range',
 'tfidf__norm',
 'tfidf__preprocessor',
 'tfidf__smooth_idf',
 'tfidf__stop_words',
 'tfidf__strip_accents',
 'tfidf__sublinear_tf',

In [115]:
#details for best model's parameters
best_grid

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

In [96]:
#score for best model's parameters
best_grid.score(X_test, y_test)

0.19784946236559139

According to result, firstly model's score is 0.15. After grid search  model's score increases 0.20. 

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [100]:
#find classification report for each columns
evaluation(best_grid,y_test, X_test, column_names)


Precision is ill-defined and being set to 0.0 due to no predicted samples.


F-score is ill-defined and being set to 0.0 due to no predicted samples.



                        Accuracy  Precision  Recall        F1
related                 0.757450    0.75745     1.0  0.861988
request                 0.825192    0.00000     0.0  0.000000
offer                   0.995084    0.00000     0.0  0.000000
aid_related             0.590169    0.00000     0.0  0.000000
medical_help            0.922120    0.00000     0.0  0.000000
medical_products        0.949770    0.00000     0.0  0.000000
search_and_rescue       0.969278    0.00000     0.0  0.000000
security                0.980799    0.00000     0.0  0.000000
military                0.969739    0.00000     0.0  0.000000
child_alone             1.000000    0.00000     0.0  0.000000
water                   0.936713    0.00000     0.0  0.000000
food                    0.888479    0.00000     0.0  0.000000
shelter                 0.910138    0.00000     0.0  0.000000
clothing                0.985714    0.00000     0.0  0.000000
money                   0.976498    0.00000     0.0  0.000000
missing_

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [108]:
from sklearn.ensemble import AdaBoostClassifier

#create new pipeline
pipeline_new = Pipeline([  
                    ('tfidf', TfidfVectorizer(tokenizer=tokenize)),
                    ('clf', MultiOutputClassifier(AdaBoostClassifier()))
                    ])

In [109]:
#fitting
pipeline_new.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...mator=None,
          learning_rate=1.0, n_estimators=50, random_state=None),
           n_jobs=1))])

In [110]:
#predict result base on X_test
y_pred_test_1= pipeline_new.predict(X_test)

In [111]:
#predict result base on X_train 
y_pred_train_1= pipeline_new.predict(X_train)

In [112]:
#calculate test score 
pipeline_new.score(X_test, y_test)

0.158678955453149

In [113]:
#calculate train score
pipeline_new.score(X_train,y_train)

0.17686517486814482

In [114]:
#find classification report for each columns
evaluation(pipeline_new,y_test, X_test, column_names)


Precision is ill-defined and being set to 0.0 due to no predicted samples.


F-score is ill-defined and being set to 0.0 due to no predicted samples.



                        Accuracy  Precision    Recall        F1
related                 0.753917   0.762250  0.981140  0.857954
request                 0.822581   0.473684  0.134446  0.209446
offer                   0.995084   0.000000  0.000000  0.000000
aid_related             0.581874   0.470779  0.163043  0.242205
medical_help            0.920891   0.100000  0.001972  0.003868
medical_products        0.949462   0.000000  0.000000  0.000000
search_and_rescue       0.968664   0.000000  0.000000  0.000000
security                0.980031   0.222222  0.016000  0.029851
military                0.967896   0.000000  0.000000  0.000000
child_alone             1.000000   0.000000  0.000000  0.000000
water                   0.935945   0.000000  0.000000  0.000000
food                    0.885100   0.000000  0.000000  0.000000
shelter                 0.908756   0.090909  0.001709  0.003356
clothing                0.984332   0.090909  0.010753  0.019231
money                   0.976190   0.250

### 9. Export your model as a pickle file

In [None]:
import pickle
filename = 'clf.sav'
pickle.dump(model, open(filename, 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.