# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [14]:
from google.colab import drive
drive.mount('/content/drive')
main = '/content/drive/My Drive/Colab Notebooks/'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [26]:
# data management
import pandas as pd
from sqlalchemy import create_engine
# nlp
import re
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
# ML essentials
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
# Text engineering - TruncatedSVD for LSA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
# Classifiers
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
# Gridsearch
from sklearn.model_selection import GridSearchCV
# Evaluation metrics
from sklearn.metrics import classification_report, hamming_loss, precision_recall_fscore_support
from sklearn.metrics import accuracy_score, make_scorer, f1_score

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 2. Write a tokenization function to process your text data

In [0]:
# load data from database
def load_data(db_path, table_name):
    engine = create_engine(db_path)
    df = pd.read_sql(f"SELECT * FROM {table_name}", engine)

    X = df['message']
    y = df.iloc[:, 4:]
    return X, y

def tokenize(text):
    
    lemmatizer = WordNetLemmatizer()
    stop_words = stopwords.words('english')
    
    # normalize - remove puncuation & convert to lowercase
    text = re.sub('[^a-zA-Z0-9\s]', '', text.lower())    
    # tokenize
    tokens = word_tokenize(text)    
    # stopwords filter
    tokens = [word.strip() for word in tokens if not word in stop_words]
    # lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    return tokens

In [0]:
X, y = load_data(f"sqlite:///{main}distab.db", "DisasterTable")

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [0]:
pipeline = Pipeline([
    ('vect', TfidfVectorizer(tokenizer=tokenize)),
    ('lsa', TruncatedSVD(n_components=100, random_state=42)),
    #('clf', MultiOutputClassifier(RandomForestClassifier(n_estimators=100)))
    ('clf', MultiOutputClassifier(MLPClassifier(learning_rate='constant', learning_rate_init=0.001,
                                                max_iter=300, early_stopping=True, n_iter_no_change=5,
                                                random_state=42, warm_start=True, verbose=1)))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=42)
pipeline.fit(X_train, y_train)

Iteration 1, loss = 0.60398018
Validation score: 0.767666
Iteration 2, loss = 0.49318912
Validation score: 0.780376
Iteration 3, loss = 0.44728439
Validation score: 0.800203
Iteration 4, loss = 0.42158255
Validation score: 0.807829
Iteration 5, loss = 0.40967015
Validation score: 0.809863
Iteration 6, loss = 0.40277528
Validation score: 0.818505
Iteration 7, loss = 0.39860278
Validation score: 0.819522
Iteration 8, loss = 0.39516466
Validation score: 0.821556
Iteration 9, loss = 0.39190603
Validation score: 0.819522
Iteration 10, loss = 0.38992964
Validation score: 0.822064
Iteration 11, loss = 0.38819308
Validation score: 0.820539
Iteration 12, loss = 0.38692425
Validation score: 0.823081
Iteration 13, loss = 0.38558848
Validation score: 0.824098
Iteration 14, loss = 0.38449275
Validation score: 0.822064
Iteration 15, loss = 0.38351266
Validation score: 0.822572
Iteration 16, loss = 0.38248426
Validation score: 0.827656
Iteration 17, loss = 0.38168543
Validation score: 0.827148
Iterat

Pipeline(memory=None,
         steps=[('vect',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(...
                                                               beta_2=0.999,
                                                               early_stopping=True,
                                                               epsilon=1e-08,
                                                               

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [0]:
pred = pipeline.predict(X_test)

In [0]:
def evaluate(y_true, y_pred):
    
    result = precision_recall_fscore_support(y_true, y_pred)
    scores = []
    for i, col in enumerate(y_true.columns.values):
        scores.append((result[0][i], result[1][i], result[2][i], result[3][i]))
    
    score_df = pd.DataFrame(index=y_true.columns.values, data=scores, columns=['Precision', 'Recall', 'F-Score', 'Pos_labels'])
    score_df.sort_values(by='F-Score', axis=0, ascending=False, inplace=True)

    acc = accuracy_score(y_true, y_pred)
    loss = hamming_loss(y_true, y_pred)
    print("=====Global Metrics=====\n")
    print("Accuracy: {:.4f}".format(acc))
    print("Hamming Loss: {:.4f}\n".format(loss))
    print("=====Label Metrics=====\n")
    return score_df

In [22]:
evaluate(y_test, pred)

=====Global Metrics=====

Accuracy: 0.2782
Hamming Loss: 0.0534

=====Label Metrics=====



  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,Precision,Recall,F-Score,Pos_labels
related,0.839821,0.936213,0.885401,5001
earthquake,0.883495,0.722222,0.79476,630
food,0.791726,0.767635,0.779494,723
weather_related,0.795825,0.671806,0.728576,1816
water,0.742382,0.656863,0.697009,408
aid_related,0.736129,0.643704,0.686821,2700
request,0.79871,0.566331,0.662741,1093
shelter,0.744578,0.523729,0.614925,590
storm,0.70354,0.533557,0.60687,596
direct_report,0.714634,0.463241,0.56211,1265


In [23]:
for i, col in enumerate(y_test.columns):
    print(col)
    print(classification_report(y_test[col], pred[:,i]))

  _warn_prf(average, modifier, msg_start, len(result))


related
              precision    recall  f1-score   support

           0       0.67      0.42      0.52      1553
           1       0.84      0.94      0.89      5001

    accuracy                           0.82      6554
   macro avg       0.76      0.68      0.70      6554
weighted avg       0.80      0.82      0.80      6554

request
              precision    recall  f1-score   support

           0       0.92      0.97      0.94      5461
           1       0.80      0.57      0.66      1093

    accuracy                           0.90      6554
   macro avg       0.86      0.77      0.80      6554
weighted avg       0.90      0.90      0.90      6554

offer
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6522
           1       0.00      0.00      0.00        32

    accuracy                           1.00      6554
   macro avg       0.50      0.50      0.50      6554
weighted avg       0.99      1.00      0.99      655

### 6. Improve your model
Use grid search to find better parameters. 

In [0]:
parameters = {'vect__ngram_range': [(1,1), (1,2)],
              'vect__max_df': [0.7, 1.0],
              'vect__max_features': [None, 5000],
              'lsa__n_components': [100, 200, 300],
              'clf__estimator__learning_rate_init': [0.001, 0.0001],
              'clf__estimator__max_iter': [300],
              'clf__estimator__n_iter_no_change': [5],
              'clf__estimator__warm_start': [True],
              'clf__estimator__early_stopping': [True],
              'clf__estimator__random_state': [42]}

cv = GridSearchCV(pipeline, param_grid=parameters, scoring=make_scorer(f1_score, average='macro'), n_jobs=4, cv=5, verbose=10)

In [25]:
cv.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:  5.1min
[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:  8.8min
[Parallel(n_jobs=4)]: Done  17 tasks      | elapsed: 14.0min
[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed: 17.4min
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed: 25.4min
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed: 32.5min
[Parallel(n_jobs=4)]: Done  53 tasks      | elapsed: 44.7min
[Parallel(n_jobs=4)]: Done  64 tasks      | elapsed: 52.9min
[Parallel(n_jobs=4)]: Done  77 tasks      | elapsed: 68.1min
[Parallel(n_jobs=4)]: Done  90 tasks      | elapsed: 83.7min
[Parallel(n_jobs=4)]: Done 105 tasks      | elapsed: 103.0min
[Parallel(n_jobs=4)]: Done 120 tasks      | elapsed: 121.6min
[Parallel(n_jobs=4)]: Done 137 tasks      | elapsed: 129.9min
[Parallel(n_jobs=4)]: Done 154 tasks      | elapsed: 138.6min
[Parallel(n_jobs=4)]: Done 173 tasks      | elapsed: 149.8min
[Para

Iteration 1, loss = 0.56885467
Validation score: 0.767666
Iteration 2, loss = 0.47156335
Validation score: 0.791561
Iteration 3, loss = 0.41244352
Validation score: 0.820539
Iteration 4, loss = 0.38812099
Validation score: 0.826131
Iteration 5, loss = 0.37811065
Validation score: 0.828673
Iteration 6, loss = 0.37225984
Validation score: 0.831723
Iteration 7, loss = 0.36783790
Validation score: 0.831723
Iteration 8, loss = 0.36415502
Validation score: 0.830198
Iteration 9, loss = 0.36116132
Validation score: 0.832740
Iteration 10, loss = 0.35825997
Validation score: 0.831723
Iteration 11, loss = 0.35612590
Validation score: 0.835791
Iteration 12, loss = 0.35374835
Validation score: 0.834774
Iteration 13, loss = 0.35166136
Validation score: 0.836299
Iteration 14, loss = 0.34931557
Validation score: 0.835282
Iteration 15, loss = 0.34724573
Validation score: 0.840874
Iteration 16, loss = 0.34525980
Validation score: 0.835791
Iteration 17, loss = 0.34328692
Validation score: 0.838841
Iterat

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        no

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [0]:
tuned_pipe = cv.best_estimator_

In [41]:
tuned_pipe

Pipeline(memory=None,
         steps=[('vect',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=0.7, max_features=5000,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(...
                                                               beta_2=0.999,
                                                               early_stopping=True,
                                                               epsilon=1e-08,
                                                               

In [0]:
predicted_labels = tuned_pipe.predict(X_test)

evaluate(y_test, predicted_labels)

In [0]:
def predict_new_data(model, labels, docs):
    
    prediction = model.predict(docs)
    
    tagged = []

    for i, j in enumerate(prediction.tolist()):
        tags = dict(zip(labels, prediction.tolist()[i]))
        tags = [k for k in tags.keys() if tags[k] == 1]
        tagged.append(tags)
        print("\n", docs[i])
        print(tags)

    #return tagged


In [77]:
test_docs = ["Fires out of control in Australian badlands",
             "Maple Leafs crush Canadiens in amazing 7-0 lockout",
             "Floods Fatal in Alicante. Roads Washed Out",
             "Hurricane Sandy Touches Down in New York. Critical Damage."]

predict_new_data(tuned_pipe, y_test.columns, test_docs)



 Fires out of control in Australian badlands
['related', 'weather_related', 'fire']

 Maple Leafs crush Canadiens in amazing 7-0 lockout
['related']

 Floods Fatal in Alicante. Roads Washed Out
['related', 'transport', 'weather_related', 'floods']

 Hurricane Sandy Touches Down in New York. Critical Damage.
['related', 'weather_related', 'storm']


### 9. Export your model as a pickle file

In [0]:
import pickle
with open(f'{main}mlp_class.pkl', 'wb') as out:
    pickle.dump(tuned_pipe, out)

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.