# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [87]:
# import libraries
import time
import re
import nltk
import pickle
import pandas as pd

from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# Suppress warnings
import warnings; warnings.simplefilter('ignore')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ursula\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ursula\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ursula\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ursula\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [64]:
# load data from database
engine = create_engine('sqlite:///Messages.db')
df = pd.read_sql_table('Messages', engine) 

In [65]:
# Print first few lines of dataframe
df.head(2)

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [66]:
# Define X and Y variables
X = df['message']
Y = df.drop(['id', 'message', 'original', 'genre'], axis=1)

### 2. Write a tokenization function to process your text data

In [67]:
# Define tokenize function
# Excluded the lemmatization step to increase speed
# Defining the variable "stop_words" makes the code run significantly faster

def tokenize(text):
    
    # Case normalization
    text = text.lower()
    
    # Puncuation removal
    text = re.sub(r'[^a-zA-Z0-9]', ' ', text)
    
    # Tokenize words
    words = word_tokenize(text)
    
    # Stop word removal
    stop_words = stopwords.words("english")
    words = [w for w in words if w not in stop_words]
    
    # Perform stemming
    stemmer = PorterStemmer()
    words = [stemmer.stem(w) for w in words if w not in stop_words]
    
    return words

In [68]:
%%timeit

# Time how long the tokenize function to work on this short sentence
# to help with speed optimization

text = "You are the greatest, no 1.0, person in the world."
tokenize(text)

962 µs ± 8.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [69]:
# Build ML pipeline using random forest classifier
pipeline1 = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [70]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y)

start = time.time()

# Train model
pipeline1.fit(X_train, y_train)

end = time.time()
train_time1 = end - start
print(train_time1)

96.1573212146759


### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [71]:
# Predict labels using model
start = time.time()

y_pred1 = pipeline1.predict(X_test)

end = time.time()
pred_time1 = end - start
print(pred_time1)

14.200883150100708


In [72]:
# Print accuracy report 
report1 = pd.DataFrame.from_dict(classification_report(y_test, y_pred1, target_names=Y.columns, output_dict=True))
report1 = pd.DataFrame.transpose(report1)
report1

Unnamed: 0,f1-score,precision,recall,support
related,0.880092,0.843456,0.920056,4966.0
request,0.565192,0.779715,0.443243,1110.0
offer,0.0,0.0,0.0,30.0
aid_related,0.680937,0.751672,0.62237,2709.0
medical_help,0.173045,0.58427,0.101562,512.0
medical_products,0.16,0.777778,0.089172,314.0
search_and_rescue,0.022599,0.333333,0.011696,171.0
security,0.017241,0.25,0.008929,112.0
military,0.141176,0.580645,0.080357,224.0
water,0.487562,0.830508,0.34507,426.0


### 6. Improve your model
Use grid search to find better parameters. 

In [73]:
# Use grid search to find better parameters

start = time.time()

parameters = {
    'clf__estimator__n_estimators': [10, 100],
    'clf__estimator__min_samples_split': [2, 5]
}

cv = GridSearchCV(pipeline1, param_grid=parameters, verbose=1)
cv.fit(X_train, y_train)
y_pred_cv = cv.predict(X_test)

end = time.time()
cv_time = end - start
print(cv_time)

Fitting 3 folds for each of 4 candidates, totalling 12 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed: 47.8min finished


3522.90900349617


In [74]:
# Show results of gridsearch.  
# According to the rank_test_score and train_scores:
# param_clf__estimator__min_samples_split = 2 and param_clf__estimator__n_estimators = 100
# is the best option (also the slowest).

pd.DataFrame.from_dict(cv.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf__estimator__min_samples_split,param_clf__estimator__n_estimators,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
0,59.498911,3.461359,15.834931,0.224433,2,10,"{'clf__estimator__min_samples_split': 2, 'clf_...",0.243276,0.245582,0.237744,0.242201,0.003289,3,0.813662,0.815122,0.806055,0.811613,0.003975
1,319.299773,1.323695,33.626077,0.358266,2,100,"{'clf__estimator__min_samples_split': 2, 'clf_...",0.269095,0.267097,0.260028,0.265406,0.00389,1,0.995082,0.995774,0.994467,0.995108,0.000534
2,44.249653,0.262935,14.65561,2.357581,5,10,"{'clf__estimator__min_samples_split': 5, 'clf_...",0.240664,0.236361,0.227908,0.234978,0.005298,4,0.752113,0.751883,0.753343,0.752446,0.000641
3,246.829281,1.474955,33.814294,0.355266,5,100,"{'clf__estimator__min_samples_split': 5, 'clf_...",0.258337,0.260335,0.2568,0.258491,0.001447,2,0.89834,0.904641,0.898187,0.900389,0.003007


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [76]:
# Print accuracy report 
report_cv = pd.DataFrame.from_dict(classification_report(y_test, y_pred_cv, target_names=Y.columns, output_dict=True))
report_cv = pd.DataFrame.transpose(report_cv)
report_cv

Unnamed: 0,f1-score,precision,recall,support
related,0.891432,0.840407,0.949054,4966.0
request,0.634637,0.835294,0.511712,1110.0
offer,0.0,0.0,0.0,30.0
aid_related,0.736581,0.763262,0.711702,2709.0
medical_help,0.131439,0.72549,0.072266,512.0
medical_products,0.140762,0.888889,0.076433,314.0
search_and_rescue,0.088398,0.8,0.046784,171.0
security,0.0,0.0,0.0,112.0
military,0.10084,0.857143,0.053571,224.0
water,0.565625,0.845794,0.424883,426.0


In [85]:
# Compare f1 scores between default model and improved model
report_cv['f1-score'] - report1['f1-score']

related                  -0.011339
request                  -0.069444
offer                     0.000000
aid_related              -0.055644
medical_help              0.041606
medical_products          0.019238
search_and_rescue        -0.065799
security                  0.017241
military                  0.040336
water                    -0.078063
food                     -0.002230
shelter                  -0.034191
clothing                  0.083273
money                     0.010660
missing_people            0.000000
refugees                 -0.010086
death                     0.043897
other_aid                 0.037482
infrastructure_related    0.004808
transport                -0.013830
buildings                -0.081194
electricity              -0.038375
tools                     0.000000
hospitals                 0.000000
shops                     0.000000
aid_centers               0.000000
other_infrastructure      0.000000
weather_related          -0.080106
floods              

The default RF and the improved model appear to be similar in terms of f1 score.

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [77]:
# Create new pipeline using MultinomialNB classifier
pipeline2 = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(MultinomialNB()))
])

In [78]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y)

start = time.time()

# Train model
pipeline2.fit(X_train, y_train)

end = time.time()
train_time2 = end - start
print(train_time2)

41.964864015579224


In [79]:
# Test model
start = time.time()

y_pred2 = pipeline2.predict(X_test)

end = time.time()
pred_time2 = end - start
print(pred_time2)

13.308486700057983


In [80]:
# Print accuracy report 
report2 = pd.DataFrame.from_dict(classification_report(y_test, y_pred2, target_names=Y.columns, output_dict=True))
report2 = pd.DataFrame.transpose(report2)
report2

Unnamed: 0,f1-score,precision,recall,support
related,0.879958,0.790772,0.991818,5011.0
request,0.360882,0.85342,0.228821,1145.0
offer,0.0,0.0,0.0,28.0
aid_related,0.689252,0.761926,0.629234,2716.0
medical_help,0.007921,0.666667,0.003984,502.0
medical_products,0.0,0.0,0.0,301.0
search_and_rescue,0.0,0.0,0.0,176.0
security,0.0,0.0,0.0,103.0
military,0.0,0.0,0.0,233.0
water,0.0,0.0,0.0,427.0


The MultinomialNB classifier performs poorly compared to the RandomForestClassifier as it cannot predict many of the categories.

In [81]:
# Create new pipeline using SVC classifier
pipeline3 = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(SVC()))
])

In [82]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y)

start = time.time()

# Train model
pipeline3.fit(X_train, y_train)

end = time.time()
train_time3 = end - start
print(train_time3)

652.3256278038025


In [83]:
# Test model
start = time.time()

y_pred3 = pipeline3.predict(X_test)

end = time.time()
pred_time3 = end - start
print(pred_time3)

178.31374645233154


In [84]:
# Print accuracy report 
report3 = pd.DataFrame.from_dict(classification_report(y_test, y_pred3, target_names=Y.columns, output_dict=True))
report3 = pd.DataFrame.transpose(report3)
report3

Unnamed: 0,f1-score,precision,recall,support
related,0.865685,0.763178,1.0,4966.0
request,0.0,0.0,0.0,1137.0
offer,0.0,0.0,0.0,24.0
aid_related,0.0,0.0,0.0,2746.0
medical_help,0.0,0.0,0.0,517.0
medical_products,0.0,0.0,0.0,337.0
search_and_rescue,0.0,0.0,0.0,194.0
security,0.0,0.0,0.0,129.0
military,0.0,0.0,0.0,201.0
water,0.0,0.0,0.0,407.0


The SVC classifier performs poorly compared to the RandomForestClassifier as it cannot predict anything other than 'related'.

### 9. Export your model as a pickle file

In [89]:
pickle.dump(cv, open('disaster_model.sav', 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.