# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [15]:
# import libraries

import pandas as pd
import numpy as np
import pickle
from sqlalchemy import create_engine
import re
import nltk
from nltk.corpus import stopwords
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier,AdaBoostClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score, f1_score, fbeta_score, classification_report
from sklearn.metrics import precision_recall_fscore_support
from scipy.stats import hmean
from scipy.stats.mstats import gmean

nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger','stopwords'])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [16]:
# load data from database
engine = create_engine('sqlite:///InsertDatabaseName.db')
df = pd.read_sql_table('Disaster_Combined', engine)
X = df['message']
Y = df.iloc[:,4:]

### 2. Write a tokenization function to process your text data

In [17]:
def tokenize(text):
    # Normalize
    # Set text to lower case and remove punctuation
    text= text.lower()
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    # Tokenize words 
    tokens = word_tokenize(text)
    # lemmatizer and remove stopwords
    # lemmatizer
    lemmatizer = WordNetLemmatizer()
    # stopwords
    stop_words = set(stopwords.words('english'))
    # lemmatizer and remove stopwords
    clean_tokens = [lemmatizer.lemmatize(w) for w in tokens if not w in stop_words]
    clean_tokens = [] 
    for w in tokens: 
        if w not in stop_words: 
            clean_tokens.append(w)
    return clean_tokens
  

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [18]:
# Creating a pipeline with Classifier
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [19]:
#Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y,test_size=0.3)

In [20]:
#Train pipeline
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [21]:
y_pred = pipeline.predict(X_test)

In [22]:
#Getting the f1_score,precision and recall
def get_results(y_test, y_pred):
    results = pd.DataFrame(columns=['Category', 'f1_score', 'precision', 'recall'])
    num = 0
    for cat in y_test.columns:
        precision, recall, f1_score, support = precision_recall_fscore_support(y_test[cat], y_pred[:,num], average='weighted')
        results.set_value(num+1, 'Category', cat)
        results.set_value(num+1, 'f1_score', f1_score)
        results.set_value(num+1, 'precision', precision)
        results.set_value(num+1, 'recall', recall)
        num += 1
    print('f1_score:', results['f1_score'].mean())
    print('precision:', results['precision'].mean())
    print('recall:', results['recall'].mean())
    return results

In [23]:
results = get_results(y_test, y_pred)
results

f1_score: 0.932339019646
precision: 0.932968289901
recall: 0.944437380801


  import sys
  
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.
  'precision', 'predicted', average, warn_for)


Unnamed: 0,Category,f1_score,precision,recall
1,related,0.784781,0.784109,0.797203
2,request,0.868682,0.876089,0.883789
3,offer,0.994219,0.992385,0.996058
4,aid_related,0.742718,0.748956,0.748887
5,medical_help,0.894725,0.898477,0.923458
6,medical_products,0.934409,0.940351,0.952575
7,search_and_rescue,0.961818,0.961417,0.973172
8,security,0.970852,0.961469,0.98042
9,military,0.96118,0.961447,0.971011
10,child_alone,1.0,1.0,1.0


In [24]:
# Calculating the proportion of each column that have label == 1
proportion = Y.sum()/len(Y)

In [25]:
proportion.sort_values(ascending=True)
proportion

related                   0.773650
request                   0.170659
offer                     0.004501
aid_related               0.414251
medical_help              0.079493
medical_products          0.050084
search_and_rescue         0.027617
security                  0.017966
military                  0.032804
child_alone               0.000000
water                     0.063778
food                      0.111497
shelter                   0.088267
clothing                  0.015449
money                     0.023039
missing_people            0.011367
refugees                  0.033377
death                     0.045545
other_aid                 0.131446
infrastructure_related    0.065037
transport                 0.045812
buildings                 0.050847
electricity               0.020293
tools                     0.006065
hospitals                 0.010795
shops                     0.004577
aid_centers               0.011787
other_infrastructure      0.043904
weather_related     

### 6. Improve your model
Use grid search to find better parameters. 

In [26]:
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7f70fb143488>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
               oob_score=False, random_state=None,

In [27]:
# Defining median F1_score
def performance_metric(y_true, y_pred):
    """Calculate median F1 score for all of the output classifiers
    
    Args:
    y_true: array. Array containing actual labels.
    y_pred: array. Array containing predicted labels.
        
    Returns:
    score: float. Median F1 score for all of the output classifiers
    """
    f1_list = []
    for i in range(np.shape(y_pred)[1]):
        f1 = f1_score(np.array(y_true)[:, i], y_pred[:, i],average='micro')
        f1_list.append(f1)
        
    score = np.median(f1_list)
    return score

In [28]:
# Creating grid search

#this was the orginal set and it was taking a lot of time, hence, reduced the parameter space
#
#parameters = {'vect__min_df': [1, 5],
#             'tfidf__use_idf':[True, False],
#             'clf__estimator__n_estimators':[10, 25], 
#              'clf__estimator__min_samples_split':[2, 5, 10],
#              'clf__estimator__max_depth': [5, 10, None]
#             }
#




parameters = {
              'clf__estimator__n_estimators':[10, 25]
             }

scorer = make_scorer(performance_metric)
cv = GridSearchCV(pipeline, param_grid = parameters, scoring = scorer, verbose = 10)

# best parameters
tuned_model = cv.fit(X_train, y_train)

Fitting 3 folds for each of 2 candidates, totalling 6 fits
[CV] clf__estimator__n_estimators=10 .................................
[CV]  clf__estimator__n_estimators=10, score=0.9567598495994769, total=  44.4s
[CV] clf__estimator__n_estimators=10 .................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   55.1s remaining:    0.0s


[CV]  clf__estimator__n_estimators=10, score=0.9593755108713422, total=  45.1s
[CV] clf__estimator__n_estimators=10 .................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.8min remaining:    0.0s


[CV]  clf__estimator__n_estimators=10, score=0.9570050678437142, total=  45.1s
[CV] clf__estimator__n_estimators=25 .................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  2.8min remaining:    0.0s


[CV]  clf__estimator__n_estimators=25, score=0.9555337583782899, total= 1.6min
[CV] clf__estimator__n_estimators=25 .................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  4.6min remaining:    0.0s


[CV]  clf__estimator__n_estimators=25, score=0.9659964034657511, total= 1.6min
[CV] clf__estimator__n_estimators=25 .................................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  6.5min remaining:    0.0s


[CV]  clf__estimator__n_estimators=25, score=0.9581494196501552, total= 1.6min


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  8.3min remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  8.3min finished


In [29]:
# Results of grid search
tuned_model.cv_results_



{'mean_fit_time': array([ 39.52815056,  87.41527526]),
 'std_fit_time': array([ 0.3354358 ,  0.62620234]),
 'mean_score_time': array([ 5.3518184 ,  7.86026144]),
 'std_score_time': array([ 0.03889965,  0.07306144]),
 'param_clf__estimator__n_estimators': masked_array(data = [10 25],
              mask = [False False],
        fill_value = ?),
 'params': [{'clf__estimator__n_estimators': 10},
  {'clf__estimator__n_estimators': 25}],
 'split0_test_score': array([ 0.95675985,  0.95553376]),
 'split1_test_score': array([ 0.95937551,  0.9659964 ]),
 'split2_test_score': array([ 0.95700507,  0.95814942]),
 'mean_test_score': array([ 0.95771348,  0.95989319]),
 'std_test_score': array([ 0.00117949,  0.00444577]),
 'rank_test_score': array([2, 1], dtype=int32),
 'split0_train_score': array([ 0.99423737,  0.99877391]),
 'split1_train_score': array([ 0.99341998,  0.99869217]),
 'split2_train_score': array([ 0.99354259,  0.9986513 ]),
 'mean_train_score': array([ 0.99373331,  0.99870579]),
 'std_

In [30]:
# Parameters for best test score
tuned_model.best_params_

{'clf__estimator__n_estimators': 25}

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [31]:
# testing the model
cv.fit(X_train.as_matrix(), y_train.as_matrix())
y_pred = cv.predict(X_test)
results2 = get_results(y_test, y_pred)
results2

  


Fitting 3 folds for each of 2 candidates, totalling 6 fits
[CV] clf__estimator__n_estimators=10 .................................
[CV]  clf__estimator__n_estimators=10, score=0.9538172306686283, total=  44.6s
[CV] clf__estimator__n_estimators=10 .................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   55.3s remaining:    0.0s


[CV]  clf__estimator__n_estimators=10, score=0.9578224619911722, total=  45.0s
[CV] clf__estimator__n_estimators=10 .................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.9min remaining:    0.0s


[CV]  clf__estimator__n_estimators=10, score=0.9580676802354096, total=  45.3s
[CV] clf__estimator__n_estimators=25 .................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  2.8min remaining:    0.0s


[CV]  clf__estimator__n_estimators=25, score=0.9569233284289684, total= 1.6min
[CV] clf__estimator__n_estimators=25 .................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  4.6min remaining:    0.0s


[CV]  clf__estimator__n_estimators=25, score=0.9632172633643943, total= 1.6min
[CV] clf__estimator__n_estimators=25 .................................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  6.5min remaining:    0.0s


[CV]  clf__estimator__n_estimators=25, score=0.9591302926271048, total= 1.6min


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  8.3min remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  8.3min finished


f1_score: 0.936203844568
precision: 0.937830101694
recall: 0.947566574839


  import sys
  
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.
  'precision', 'predicted', average, warn_for)


Unnamed: 0,Category,f1_score,precision,recall
1,related,0.78308,0.788917,0.802543
2,request,0.884946,0.886599,0.893833
3,offer,0.994282,0.992386,0.996186
4,aid_related,0.7706,0.770293,0.771392
5,medical_help,0.899741,0.903793,0.924984
6,medical_products,0.936271,0.945079,0.953719
7,search_and_rescue,0.962811,0.963113,0.973427
8,security,0.970916,0.961472,0.980547
9,military,0.961946,0.963535,0.971647
10,child_alone,1.0,1.0,1.0


In [32]:
# Get summary stats for first model
results2.sum()

Category     relatedrequestofferaid_relatedmedical_helpmedi...
f1_score                                               33.7033
precision                                              33.7619
recall                                                 34.1124
dtype: object

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [33]:
# Creating a pipeline with Classifier
pipeline_enhance = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf',MultiOutputClassifier(AdaBoostClassifier()))
])

In [35]:
#Defining new parameters

parameters_enhance = {
                      'clf__estimator__n_estimators':[10, 25]
                     }

scorer_enhance = make_scorer(performance_metric)
cv_enhance = GridSearchCV(pipeline_enhance, param_grid = parameters_enhance, scoring = scorer_enhance, verbose = 10)

# best parameters
tuned_model_enhance = cv_enhance.fit(X_train, y_train)


Fitting 3 folds for each of 2 candidates, totalling 6 fits
[CV] clf__estimator__n_estimators=10 .................................
[CV]  clf__estimator__n_estimators=10, score=0.9608468203367664, total=  22.5s
[CV] clf__estimator__n_estimators=10 .................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   30.5s remaining:    0.0s


[CV]  clf__estimator__n_estimators=10, score=0.9651790093182933, total=  22.5s
[CV] clf__estimator__n_estimators=10 .................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.0min remaining:    0.0s


[CV]  clf__estimator__n_estimators=10, score=0.9632172633643943, total=  22.5s
[CV] clf__estimator__n_estimators=25 .................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.5min remaining:    0.0s


[CV]  clf__estimator__n_estimators=25, score=0.9629720451201569, total=  39.3s
[CV] clf__estimator__n_estimators=25 .................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  2.3min remaining:    0.0s


[CV]  clf__estimator__n_estimators=25, score=0.9650155304888017, total=  39.9s
[CV] clf__estimator__n_estimators=25 .................................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  3.2min remaining:    0.0s


[CV]  clf__estimator__n_estimators=25, score=0.9645250940003269, total=  39.5s


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  4.0min remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  4.0min finished


In [37]:
# Results of grid search
#tuned_model_enhance.cv_enhance_results_
tuned_model_enhance.cv_results_



{'mean_fit_time': array([ 18.46024323,  35.00510653]),
 'std_fit_time': array([ 0.04831694,  0.29728271]),
 'mean_score_time': array([ 4.0346458,  4.5659682]),
 'std_score_time': array([ 0.02671442,  0.0450056 ]),
 'param_clf__estimator__n_estimators': masked_array(data = [10 25],
              mask = [False False],
        fill_value = ?),
 'params': [{'clf__estimator__n_estimators': 10},
  {'clf__estimator__n_estimators': 25}],
 'split0_test_score': array([ 0.96084682,  0.96297205]),
 'split1_test_score': array([ 0.96517901,  0.96501553]),
 'split2_test_score': array([ 0.96321726,  0.96452509]),
 'mean_test_score': array([ 0.96308103,  0.96417089]),
 'std_test_score': array([ 0.00177123,  0.00087104]),
 'rank_test_score': array([2, 1], dtype=int32),
 'split0_train_score': array([ 0.96403466,  0.9659964 ]),
 'split1_train_score': array([ 0.96338074,  0.96427988]),
 'split2_train_score': array([ 0.96611901,  0.96656858]),
 'mean_train_score': array([ 0.96451147,  0.96561495]),
 'std_tr

In [38]:
# Parameters for best test score
tuned_model_enhance.best_params_

{'clf__estimator__n_estimators': 25}

In [39]:
# testing the model
cv.fit(X_train.as_matrix(), y_train.as_matrix())
y_pred = cv.predict(X_test)
results2 = get_results(y_test, y_pred)
results2

  


Fitting 3 folds for each of 2 candidates, totalling 6 fits
[CV] clf__estimator__n_estimators=10 .................................
[CV]  clf__estimator__n_estimators=10, score=0.9545528854013405, total=  44.7s
[CV] clf__estimator__n_estimators=10 .................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   55.3s remaining:    0.0s


[CV]  clf__estimator__n_estimators=10, score=0.9591302926271048, total=  45.1s
[CV] clf__estimator__n_estimators=10 .................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.9min remaining:    0.0s


[CV]  clf__estimator__n_estimators=10, score=0.9584763773091385, total=  45.3s
[CV] clf__estimator__n_estimators=25 .................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  2.8min remaining:    0.0s


[CV]  clf__estimator__n_estimators=25, score=0.957495504332189, total= 1.6min
[CV] clf__estimator__n_estimators=25 .................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  4.6min remaining:    0.0s


[CV]  clf__estimator__n_estimators=25, score=0.9627268268759196, total= 1.6min
[CV] clf__estimator__n_estimators=25 .................................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:  6.5min remaining:    0.0s


[CV]  clf__estimator__n_estimators=25, score=0.9579859408206637, total= 1.6min


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  8.3min remaining:    0.0s
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:  8.3min finished


f1_score: 0.936118728607
precision: 0.936869264405
recall: 0.947379388288


  import sys
  
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.
  'precision', 'predicted', average, warn_for)


Unnamed: 0,Category,f1_score,precision,recall
1,related,0.786343,0.791796,0.802924
2,request,0.881782,0.885191,0.892308
3,offer,0.994282,0.992386,0.996186
4,aid_related,0.768112,0.768313,0.769866
5,medical_help,0.896884,0.902719,0.924476
6,medical_products,0.935083,0.944846,0.953338
7,search_and_rescue,0.962811,0.963113,0.973427
8,security,0.970852,0.961469,0.98042
9,military,0.962874,0.964514,0.972028
10,child_alone,1.0,1.0,1.0


### 9. Export your model as a pickle file

In [40]:
# Get summary stats for first model
results2.sum()

Category     relatedrequestofferaid_relatedmedical_helpmedi...
f1_score                                               33.7003
precision                                              33.7273
recall                                                 34.1057
dtype: object

In [41]:
# best model
pickle.dump(tuned_model, open('disaster_model.sav', 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.