# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [None]:
# import libraries
import numpy as np
import pandas as pd
from sqlalchemy import create_engine
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# download nltk data
nltk.download(['stopwords', 'wordnet', 'punkt'])

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
import os

In [None]:
# Change directory to the data directory, relative to the directory containing this .ipynb
os.chdir('data')

In [None]:
!ls

CleanedMessages.db
categories.csv
messages.csv
test_save.db


In [None]:
# load data from database
engine = create_engine('sqlite:///CleanedMessages.db')

with engine.connect() as conn:
    df = pd.read_sql('CategorizedMessages', conn)

In [None]:
# Split into model input and categories
X = df[['message']]
Y = df.loc[:, 'related':]

In [None]:
X.head(3)

Unnamed: 0,message
0,Weather update - a cold front from Cuba that c...
1,Is the Hurricane over or is it not over
2,Looking for someone but no name


In [None]:
print(Y.shape)
Y.head(3)

(26216, 36)


Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Data Exploration

In [None]:
df.head(3)

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# Verify there are no null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26216 entries, 0 to 26215
Data columns (total 40 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      26216 non-null  int64 
 1   message                 26216 non-null  object
 2   original                10170 non-null  object
 3   genre                   26216 non-null  object
 4   related                 26216 non-null  int64 
 5   request                 26216 non-null  int64 
 6   offer                   26216 non-null  int64 
 7   aid_related             26216 non-null  int64 
 8   medical_help            26216 non-null  int64 
 9   medical_products        26216 non-null  int64 
 10  search_and_rescue       26216 non-null  int64 
 11  security                26216 non-null  int64 
 12  military                26216 non-null  int64 
 13  child_alone             26216 non-null  int64 
 14  water                   26216 non-null  int64 
 15  fo

In [None]:
df.genre.value_counts()

news      13054
direct    10766
social     2396
Name: genre, dtype: int64

Check values of categorization columns.

In [None]:
for col in df.columns[4:]:
    print(df[col].value_counts())
    print()

1    19906
0     6122
2      188
Name: related, dtype: int64

0    21742
1     4474
Name: request, dtype: int64

0    26098
1      118
Name: offer, dtype: int64

0    15356
1    10860
Name: aid_related, dtype: int64

0    24132
1     2084
Name: medical_help, dtype: int64

0    24903
1     1313
Name: medical_products, dtype: int64

0    25492
1      724
Name: search_and_rescue, dtype: int64

0    25745
1      471
Name: security, dtype: int64

0    25356
1      860
Name: military, dtype: int64

0    26216
Name: child_alone, dtype: int64

0    24544
1     1672
Name: water, dtype: int64

0    23293
1     2923
Name: food, dtype: int64

0    23902
1     2314
Name: shelter, dtype: int64

0    25811
1      405
Name: clothing, dtype: int64

0    25612
1      604
Name: money, dtype: int64

0    25918
1      298
Name: missing_people, dtype: int64

0    25341
1      875
Name: refugees, dtype: int64

0    25022
1     1194
Name: death, dtype: int64

0    22770
1     3446
Name: other_aid, dtype: int6

We see that while most columns have values in {0, 1} indicating false/true, the 'related' column has values from the set: {0, 1, 2}. 

I didn't find documentation that explained this, so investigate further.

What is the character of the messages in each of the categories of the 'related' column?

In [None]:
for r_val in [1, 0, 2]:
    print(f"\n============= val: {r_val} =============")
    sub_df = df[df.related==r_val]
    for ind in range(80):
        print(sub_df.message.iloc[ind])


Weather update - a cold front from Cuba that could pass over Haiti
Is the Hurricane over or is it not over
Looking for someone but no name
UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.
says: west side of Haiti, rest of the country today and tonight
Storm at sacred heart of jesus
Please, we need tents and water. We are in Silo, Thank you!
I am in Croix-des-Bouquets. We have health issues. They ( workers ) are in Santo 15. ( an area in Croix-des-Bouquets )
There's nothing to eat and water, we starving and thirsty.
I am in Thomassin number 32, in the area named Pyron. I would like to have some water. Thank God we are fine, but we desperately need water. Thanks
Let's do it together, need food in Delma 75, in didine area
More information on the 4636 number in order for me to participate. ( To see if I can use it )
A Comitee in Delmas 19, Rue ( street ) Janvier, Impasse Charite #2. We have about 500 people in a temporary shelter and we 

It seems that the 'related' value is 1 if the message is related to some disaster, and 0 otherwise. Messages with 'related' val = 2 include also untranslated messages and miscellaneous garbage.

Let's look at the values of the other categorization columns for each of the 3 values for 'related'.

In [None]:
# mean number of other flags per row when 'related' col val = 1
df[df.related==1].loc[:, 'request':].sum(axis=1).mean()

3.16693459258515

In [None]:
# mean number of other flags per row when 'related' col val = 0
df[df.related==0].loc[:, 'request':].sum(axis=1).mean()

0.0

In [None]:
# when 'related' col val = 2, none of the other flags are turned on.
df[df.related==2].loc[:, 'request':].sum(axis=1).sum()

0

**Conclusion:** Other categorization flags are on (= 1) for a row only if that row has a value of 1 for 'related'. 

### 2. Write a tokenization function to process your text data

In [None]:
stop_words = stopwords.words("english")
lemmatizer = WordNetLemmatizer()

In [None]:
def tokenize(text):
    # normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # tokenize text
    tokens = word_tokenize(text)
    
    # lemmatize and remove stop words
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    return tokens

In [None]:
for msg in df.iloc[:12,1]:
    print(msg, '\n    ', tokenize(msg), '\n')

Weather update - a cold front from Cuba that could pass over Haiti 
     ['weather', 'update', 'cold', 'front', 'cuba', 'could', 'pas', 'haiti'] 

Is the Hurricane over or is it not over 
     ['hurricane'] 

Looking for someone but no name 
     ['looking', 'someone', 'name'] 

UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately. 
     ['un', 'report', 'leogane', '80', '90', 'destroyed', 'hospital', 'st', 'croix', 'functioning', 'need', 'supply', 'desperately'] 

says: west side of Haiti, rest of the country today and tonight 
     ['say', 'west', 'side', 'haiti', 'rest', 'country', 'today', 'tonight'] 

Information about the National Palace- 
     ['information', 'national', 'palace'] 

Storm at sacred heart of jesus 
     ['storm', 'sacred', 'heart', 'jesus'] 

Please, we need tents and water. We are in Silo, Thank you! 
     ['please', 'need', 'tent', 'water', 'silo', 'thank'] 

I would like to receive the messages, thank you 
     ['

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

#### Handle 'related' column values

In [None]:
# encode all non-related entries as 0, so we have 1 for related, 0 for non-related
df_2 = df.copy()
df_2.related = df_2.related.replace(2, 0)

In [None]:
df_2.related.value_counts()

1    19906
0     6310
Name: related, dtype: int64

#### Pipeline

In [None]:
# pipeline = 

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
# parameters = 

# cv = 

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.