# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

# Phase 4: Modeling

In [None]:
#RUN       pip install scikit-learn==0.22.2        in terminal first to make sure scikit .22 running

In [1]:
# import libraries and download NLTK
import pickle
import nltk
import torch
import sklearn
nltk.download(['wordnet','punkt','stopwords'])
import pandas as pd
import numpy as np
import scipy as sp
from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.datasets import make_multilabel_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.utils.multiclass import type_of_target
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
#confirm sklearn .22 running
sklearn_version = sklearn.__version__
print(sklearn_version)

0.19.1


In [26]:
# Function for ETL
def ETL_Data():
    engine = create_engine('sqlite:///DRP.db')
    df = pd.read_sql_table('etl_pipeline', con=engine)
    #classification ML model:
    X = df.message #messages for X input, REMOVED .values 11/14/21 to 'elbow'
    y = df.drop(['id','message','original','genre'],axis=1) #y = classification categories (exclude other columns)
    y = y.values
    ycol = df.drop(['id','message','original','genre'],axis=1)
    ycol = list(ycol.columns)
    return X, y, ycol

### 2. Write a tokenization function to process your text data

In [5]:
#tokenize with and without stop words
def tokenize_w_stopw(text):
    
    '''function to tokenize messages with and without stop words
    
    - tokenize words
    - lemmatize words
    - lowercase words
    - remove whitespace
    - append clean tokens
    - identify stopwords; append
    
    '''

    #var for word tokenization
    tokens = word_tokenize(text)
    #instantiate lemmatizer (used to split into words and convert words to base forms)
    lemmatizer = WordNetLemmatizer()
    

    # iterate through tokenized words and append tokenized and lemmatized words to clean_tokens var
    clean_tokens = []
    clean_tok_nostop = []
    for tok in tokens:
        
        # lemmatize, normalize/put into lower case + remove leading/trailing white space
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)
        #remove stopwords
        if tok not in stopwords.words("english"):
            clean_tok_nostop.append(clean_tok)

    return clean_tokens, clean_tok_nostop

In [6]:
#check stopwords in English
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [7]:
# test tokenize function on first 10 messages from db, compare with and without stop words 
#looks like stop word removal takes too much meaning from messages
#example, row 2  word 'over' was removed but that word creates value). 
#example, row 3, 'Looking for someone but no name' without stopwords changes meaning
#For this ML exercise will NOT remove stop words
X,y = ETL_Data()
for message in X[0:10]:
    tokens = tokenize_w_stopw(message)
    print('UNCLEAN MSG: ', message)
    print('\n')
    print('TOKENIZED TXT: ', tokens[0], '\n')
    print('TOKENIZED TXT with STOPWORDS: ', tokens[1], '\n')
    print('\n','-----------------------------------------------------------------------------')

UNCLEAN MSG:  Weather update - a cold front from Cuba that could pass over Haiti


TOKENIZED TXT:  ['weather', 'update', '-', 'a', 'cold', 'front', 'from', 'cuba', 'that', 'could', 'pas', 'over', 'haiti'] 

TOKENIZED TXT with STOPWORDS:  ['weather', 'update', '-', 'cold', 'front', 'cuba', 'could', 'pas', 'haiti'] 


 -----------------------------------------------------------------------------
UNCLEAN MSG:  Is the Hurricane over or is it not over


TOKENIZED TXT:  ['is', 'the', 'hurricane', 'over', 'or', 'is', 'it', 'not', 'over'] 

TOKENIZED TXT with STOPWORDS:  ['is', 'hurricane'] 


 -----------------------------------------------------------------------------
UNCLEAN MSG:  Looking for someone but no name


TOKENIZED TXT:  ['looking', 'for', 'someone', 'but', 'no', 'name'] 

TOKENIZED TXT with STOPWORDS:  ['looking', 'someone', 'name'] 


 -----------------------------------------------------------------------------
UNCLEAN MSG:  UN reports Leogane 80-90 destroyed. Only Hospital St.

In [29]:
#tokenize function WITHOUT stop words (used for project below)
def tokenize(text):
    '''function to tokenize messages without using stop words
    - tokenize words
    - lemmatize words
    - lowercase words
    - remove whitespace
    - append clean tokens
    '''
    #var for word tokenization
    tokens = word_tokenize(text)
    #instantiate lemmatizer (used to split into words and convert words to base forms)
    lemmatizer = WordNetLemmatizer()

    # iterate through tokenized words and append tokenized and lemmatized words to clean_tokens var
    clean_tokens = []
    clean_tok_nostop = []
    for tok in tokens:
        
        # lemmatize, normalize/put into lower case + remove leading/trailing white space
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)


    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [30]:
#ML Pipeline. Chose multinomial Naive Bayes as 1st choice after reviewing the scikit algorithm cheat sheet:
#https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
def nbpipe(X_train, X_test, y_train, y_test):
    '''function to:
    - pipeline using countvectorizer and tfidf transformers; classify with multinomial naive bayes
    - fit pipeline with training data
    - predict on test data
    '''
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)), 
        ('tfidf', TfidfTransformer()), 
        ('clf', MultiOutputClassifier(MultinomialNB()))
    ])
    
    
    # fit/train transformers and classifier
    pipeline.fit(X_train, y_train)
    # predict on test data
    y_pred = pipeline.predict(X_test)
    
    return pipeline, y_pred

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [31]:
#split data to train and test
X, y, ycol = ETL_Data()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [32]:
#train pipeline
pipeline, y_pred = nbpipe(X_train, X_test, y_train, y_test)

# Phase 5: Evaluation (and some re-modeling to improve scores)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [33]:
#Multinomial NB score prior to gridsearch
target_names = ycol
print(classification_report(y_test, y_pred, target_names=target_names, zero_division=0))
#Precision — What percent of your predictions were correct? Precision = TP/(TP + FP)
#Recall — What percent of the positive cases did you catch? Recall = TP/(TP+FN)
#F1 score — What percent of positive predictions were correct? F1 Score = 2*(Recall * Precision) / (Recall + Precision)
#Support is the number of actual occurrences of the class in the specified dataset

                        precision    recall  f1-score   support

               related       0.77      1.00      0.87      6617
               request       0.91      0.13      0.22      1450
                 offer       0.00      0.00      0.00        41
           aid_related       0.77      0.58      0.66      3607
          medical_help       0.00      0.00      0.00       694
      medical_products       1.00      0.00      0.00       441
     search_and_rescue       0.00      0.00      0.00       242
              security       0.00      0.00      0.00       170
              military       0.00      0.00      0.00       265
           child_alone       0.00      0.00      0.00         0
                 water       0.00      0.00      0.00       542
                  food       0.67      0.00      0.01       950
               shelter       0.00      0.00      0.00       785
              clothing       0.00      0.00      0.00       123
                 money       0.00      

### 6. Improve your model
Use grid search to find better parameters. 

In [34]:
#rerun pipeline without fitting (instantiated only)
def nbpipe_improve(X_train, X_test, y_train, y_test):
    '''function to:
    - pipeline using countvectorizer and tfidf transformers; classify with multinomial naive bayes
    '''
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)), 
        ('tfidf', TfidfTransformer()), 
        ('clf', MultiOutputClassifier(MultinomialNB()))
    ])
    
    
    return pipeline

#retrain pipeline
pipeline = nbpipe_improve(X_train, X_test, y_train, y_test)

In [35]:
#parameters for gridsearch + model fitting; then print best parameters from analysis
parameters = {

#for naive bayes    
        'vect__ngram_range': ((1, 1), (1, 2)),
        'vect__max_df': (0.5, 0.75, 1.0),
        'vect__max_features': (None, 5000, 10000),
        'tfidf__use_idf': (True, False)
}

cv = GridSearchCV(pipeline, param_grid=parameters, verbose=3)

cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)

print(cv.best_params_)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV] tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1) 
[CV]  tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1), score=0.172, total=   9.2s
[CV] tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1) 


[Parallel(n_jobs=None)]: Done   1 out of   1 | elapsed:    9.2s remaining:    0.0s


[CV]  tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1), score=0.159, total=   9.2s
[CV] tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1) 


[Parallel(n_jobs=None)]: Done   2 out of   2 | elapsed:   18.3s remaining:    0.0s


[CV]  tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1), score=0.166, total=   9.2s
[CV] tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1) 
[CV]  tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1), score=0.175, total=   9.4s
[CV] tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1) 
[CV]  tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 1), score=0.164, total=   9.2s
[CV] tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 2) 
[CV]  tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 2), score=0.182, total=  11.4s
[CV] tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 2) 
[CV]  tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, vect__ngram_range=(1, 2), score=0.167, total=  11.5s
[CV] 

[Parallel(n_jobs=None)]: Done 180 out of 180 | elapsed: 30.0min finished


{'tfidf__use_idf': True, 'vect__max_df': 0.5, 'vect__max_features': 5000, 'vect__ngram_range': (1, 2)}


In [36]:
#tuning model per best_params
def nbpipe_tuned(X_train, X_test, y_train, y_test):
    '''function to:
    - pipeline using countvectorizer and tfidf transformers; 
    - pipeline classify with multinomial naive bayes, best params after gridsearch analysis
    - fit pipeline with training data
    - predict on test data
    '''
    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize, max_df = 0.5, max_features = 5000, ngram_range=(1, 2))), 
        ('tfidf', TfidfTransformer(use_idf = True)), 
        ('clf', MultiOutputClassifier(MultinomialNB()))
    ])
    
    
    # fit/train transformers and classifier on tuned model
    pipeline.fit(X_train, y_train)
    # predict on test data
    y_pred = pipeline.predict(X_test)
    
    return pipeline, y_pred

#train pipeline
pipeline, y_pred = nbpipe_tuned(X_train, X_test, y_train, y_test)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [37]:
#Multinomial NB score with tuning.
#F1 Score and accuracy score better with tuned model(below)
target_names = ycol
print(classification_report(y_test, y_pred, target_names=target_names, zero_division=0))
#Precision — What percent of your predictions were correct? Precision = TP/(TP + FP)
#Recall — What percent of the positive cases did you catch? Recall = TP/(TP+FN)
#F1 score — What percent of positive predictions were correct? F1 Score = 2*(Recall * Precision) / (Recall + Precision)
#Support is the number of actual occurrences of the class in the specified dataset

                        precision    recall  f1-score   support

               related       0.84      0.93      0.88      6617
               request       0.71      0.66      0.68      1450
                 offer       0.00      0.00      0.00        41
           aid_related       0.75      0.65      0.69      3607
          medical_help       0.61      0.13      0.21       694
      medical_products       0.65      0.17      0.27       441
     search_and_rescue       0.00      0.00      0.00       242
              security       0.00      0.00      0.00       170
              military       0.63      0.05      0.08       265
           child_alone       0.00      0.00      0.00         0
                 water       0.78      0.27      0.40       542
                  food       0.77      0.45      0.57       950
               shelter       0.82      0.25      0.38       785
              clothing       0.58      0.12      0.20       123
                 money       1.00      

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [38]:
#PIPELINE, KNN ONLY
def knn_pipe(X_train, X_test, y_train, y_test):
    '''function to:
    - pipeline using countvectorizer and tfidf transformers; classify with K nearest neighbor
    - fit pipeline with training data
    - predict on test data
    '''
    pipeline2 = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize, max_df = 0.5, max_features = 5000, ngram_range=(1, 2))), 
        ('tfidf', TfidfTransformer(use_idf = True)), 
        ('clf', MultiOutputClassifier(KNeighborsClassifier(n_neighbors = 5)))
    ])
    
    
    # fit/train transformers and classifier
    pipeline2.fit(X_train, y_train)
    # predict on test data
    y_pred2 = pipeline2.predict(X_test)
    
    return pipeline2, y_pred2

#train pipeline
pipeline2, y_pred2 = knn_pipe(X_train, X_test, y_train, y_test)

In [39]:
target_names = ycol
print(classification_report(y_test, y_pred2, target_names=target_names, zero_division=0))
#Precision — What percent of your predictions were correct? Precision = TP/(TP + FP)
#Recall — What percent of the positive cases did you catch? Recall = TP/(TP+FN)
#F1 score — What percent of positive predictions were correct? F1 Score = 2*(Recall * Precision) / (Recall + Precision)
#Support is the number of actual occurrences of the class in the specified dataset

                        precision    recall  f1-score   support

               related       0.78      0.99      0.87      6617
               request       0.84      0.06      0.12      1450
                 offer       0.00      0.00      0.00        41
           aid_related       0.72      0.03      0.06      3607
          medical_help       0.83      0.01      0.01       694
      medical_products       0.57      0.02      0.04       441
     search_and_rescue       0.00      0.00      0.00       242
              security       0.00      0.00      0.00       170
              military       0.00      0.00      0.00       265
           child_alone       0.00      0.00      0.00         0
                 water       0.88      0.04      0.07       542
                  food       0.84      0.06      0.11       950
               shelter       0.78      0.02      0.04       785
              clothing       0.40      0.02      0.03       123
                 money       0.25      

### Multinominal Naive Bayes with tuned parameters peforms better than the K Neareset Neighbor model. Micro ## Average F1 (accuracy) and recall scords much better with tuned MNB. Precision similiar across all models.
## Will stick with fine tuned MNB as model of choice

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.