# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [18]:
# import libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputClassifier

from sklearn.metrics import f1_score, recall_score, precision_score, classification_report, confusion_matrix
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import GridSearchCV

from sklearn.tree import DecisionTreeClassifier


import pickle
import sys



import re
nltk.download(['stopwords','punkt','wordnet','averaged_perceptron_tagger'])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
# load data from database
engine = create_engine('sqlite:///InsertDatabaseName.db')
df = pd.read_sql_table('messages_labeled',engine)
df = df[df.related != 2]

X = df['message']
Y = df.drop(['id', 'message', 'original', 'genre'], axis = 1)

cat_names = Y.columns.tolist()

In [3]:
df.head(1)

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
X.shape, Y.shape

((25992,), (25992, 36))

In [5]:
X[:5]

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
Name: message, dtype: object

### 2. Write a tokenization function to process your text data

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [7]:
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [8]:
X_train, X_test, y_train, y_test  = train_test_split(X,Y,test_size = 0.33, random_state = 42)

In [9]:
X_train.shape, y_train.shape

((17414,), (17414, 36))

In [10]:
X_train[:2]

25889    In southwest Sichuan province, seven people, i...
18484    The Somalia's Transitional Government (TNG) ca...
Name: message, dtype: object

In [11]:
y_train[:2]

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
25889,1,0,0,1,1,0,0,0,0,0,...,0,0,1,1,1,0,0,0,1,0
18484,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
pipeline.fit(X_train,y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

In [13]:
y_pred = pipeline.predict(X_test)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [14]:
def display_results(y_true,y_pred):
    '''
    ARGS:
    > Y_true (series), dependent variable from testing set
    > Y_pred (series), predicted value 
    
    OUTPUT: 
    Classification report based on recall, precision, and F1 score
    '''
    
    
    

In [15]:
for i, col in enumerate(y_test):
    print(i, col)

0 related
1 request
2 offer
3 aid_related
4 medical_help
5 medical_products
6 search_and_rescue
7 security
8 military
9 child_alone
10 water
11 food
12 shelter
13 clothing
14 money
15 missing_people
16 refugees
17 death
18 other_aid
19 infrastructure_related
20 transport
21 buildings
22 electricity
23 tools
24 hospitals
25 shops
26 aid_centers
27 other_infrastructure
28 weather_related
29 floods
30 storm
31 fire
32 earthquake
33 cold
34 other_weather
35 direct_report


In [16]:
confusion_df = pd.DataFrame(confusion_matrix(y_test.iloc[:,1],y_pred[:,1]),
             columns=["Predicted Class " + str(class_name) for class_name in [0,1]],
             index = ["Class " + str(class_name) for class_name in [0,1]])

print(confusion_df)

         Predicted Class 0  Predicted Class 1
Class 0               6937                157
Class 1                871                613


In [17]:
163/1484

0.10983827493261455

In [18]:
clas_rep = classification_report(y_test.iloc[:,1],y_pred[:,1])
#print(clas_rep)
precision,recall,fscore,support=score(y_test.iloc[:,1],y_pred[:,1])


In [19]:
print(clas_rep)

             precision    recall  f1-score   support

          0       0.89      0.98      0.93      7094
          1       0.80      0.41      0.54      1484

avg / total       0.87      0.88      0.86      8578



In [20]:
print(precision, recall,fscore,support)

[ 0.88844775  0.7961039 ] [ 0.97786862  0.41307278] [ 0.93101597  0.54392192] [7094 1484]


In [21]:
for i, col in enumerate(y_test):
    
    try: 
        y_true = y_test[col]
        y_pred2 = y_pred[:,i]
        clas_report = classification_report(y_true, y_pred2)
        precision,recall,fscore,support=score(y_true, y_pred2)

        print(i,col)
        print(clas_report)
        print(f'Precision: from the {y_pred2.sum()} tweets labeled as {col}, {round(precision[1]*100,1)}% were actualy {col}')
        print(f'Recall: From the {support[1]} tweets that were actually {col}, {round(recall[1]*100,1)}% were labeled as {col} \n' )
        print('-------------------------------------------------------')
        
    except: pass
    

0 related
             precision    recall  f1-score   support

          0       0.62      0.47      0.53      2036
          1       0.85      0.91      0.88      6542

avg / total       0.79      0.81      0.80      8578

Precision: from the 7033 tweets labeled as related, 84.6% were actualy related
Recall: From the 6542 tweets that were actually related, 91.0% were labeled as related 

-------------------------------------------------------
1 request
             precision    recall  f1-score   support

          0       0.89      0.98      0.93      7094
          1       0.80      0.41      0.54      1484

avg / total       0.87      0.88      0.86      8578

Precision: from the 770 tweets labeled as request, 79.6% were actualy request
Recall: From the 1484 tweets that were actually request, 41.3% were labeled as request 

-------------------------------------------------------
2 offer
             precision    recall  f1-score   support

          0       1.00      1.00      1.0

  'precision', 'predicted', average, warn_for)


IndexError: index 1 is out of bounds for axis 0 with size 1

### 6. Improve your model
Use grid search to find better parameters. 

In [22]:
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7f4d4c411048>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
               oob_score=False, random_state=None,

In [27]:
parameters = {
    'clf__estimator__n_estimators' : [200,400,1000,2000],
    'clf__estimator__max_depth' : [10,20,50,100,None], 
    'clf__estimator__min_samples_split' : [2,5,10],
    'clf__estimator__min_samples_leaf' : [1,2,4]
}

cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)

In [28]:
cv.best_params_

{'clf__estimator__min_samples_leaf': 1, 'clf__estimator__min_samples_split': 2}

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [None]:
def build_model2():
    '''
    Returns GrindeSearchCV object as model, i.e. classifier with optimized parameters
    
    args: 
        None
        
    Returns: 
        cv : GridSearch Model Object

    '''
    pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(DecisionTreeClassifier()))
    ])
    
    parameters = {
    #'clf__estimator__n_estimators' : [10,100],
    #'clf__estimator__max_depth' : [10,100,None], 
    #'clf__estimator__min_samples_split' : [2,10],
    'clf__estimator__min_samples_leaf' : [1,2]
    }

    cv = GridSearchCV(pipeline, param_grid=parameters,n_jobs=-1)
    return cv


### 9. Export your model as a pickle file

In [38]:
def save_model(model, model_filepath):
    """saves the model to the given filepath
    Args:
        model (scikit-learn): fitted model
        model_filepath (string): filepath
    Returns:
        None
    """
    pickle.dump(model.best_estimator_, open(model_filepath, 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.

In [2]:
def load_data(db_filepath='InsertDatabaseName.db'):
    '''
    Reads database sqlite data base as a data Frame
    Separates into independent variable X and dependent variable Y
    
    Args:
        db_filepath(str): filepath to the database
    
    Returns: 
        X (pandas DataFrame): Independent variable
        Y (pandas DataFrame): Dependent variable 
        cat_names (list): Y category names
    '''
    
    engine = create_engine('sqlite:///' + db_filepath)
    df = pd.read_sql_table('messages_labeled',engine)
    df = df[df.related != 2]

    X = df['message']
    Y = df.drop(['id', 'message', 'original', 'genre'], axis = 1)
    
    cat_names = Y.columns.tolist()

    return X, Y

In [3]:
def tokenize(text):
    '''
    INPUT: Text (string)
    PROCESS: 
    > Lowercase
    > word_tokenize 
    > remove stopwords
    
    OUTPUT: Normalized list of words
    '''
    text = re.sub("[^a-zA-Z0-9]"," ",text) # remove special characters
    text = text.lower() #lowercase entire text
    words = word_tokenize(text) #Split into words
    
    stop_words = stopwords.words('english') # load stop words
    words = [word for word in words if word not in stop_words] #only those words not in stop_words
    
    #Lemmatization & stemmization
    
    lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
    stemmed = [PorterStemmer().stem(w) for w in lemmed]
    
    return stemmed
    
    

In [12]:
def build_model():
    '''
    Returns GrindeSearchCV object as model, i.e. classifier with optimized parameters
    
    args: 
        None
        
    Returns: 
        cv : GridSearch Model Object

    '''
    pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])
    
    parameters = {
    #'clf__estimator__n_estimators' : [10,100],
    #'clf__estimator__max_depth' : [10,100,None], 
    #'clf__estimator__min_samples_split' : [2,10],
    'clf__estimator__min_samples_leaf' : [1,2]
    }

    cv = GridSearchCV(pipeline, param_grid=parameters,n_jobs=-1)
    return cv

    

In [4]:
def evaluate_model(model, X_test, y_test):
    '''
    Args: 
        model (classifier)
        X_test (pandas DataFrame) : independent variables for testings=
        y_test (pandas DataFrame) : Dependent variables with 'true' values
        
    Output
        printed scores
    '''
    
    y_pred = model.predict(X_test)
    
    for i, col in enumerate(y_test):
    
        try: 
            y_true = y_test[col]
            y_pred2 = y_pred[:,i]
            clas_report = classification_report(y_true, y_pred2)
            precision,recall,fscore,support=score(y_true, y_pred2)

            print(i,col)
            print(clas_report)
            print(f'Precision: from the {y_pred2.sum()} tweets labeled as {col}, {round(precision[1]*100,1)}% were actualy {col}')
            print(f'Recall: From the {support[1]} tweets that were actually {col}, {round(recall[1]*100,1)}% were labeled as {col} \n' )
            print('-------------------------------------------------------')

        except: pass

    

In [8]:
X, Y = load_data()

In [9]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.33)

In [13]:
model = build_model()

In [14]:
model.fit(X_train,Y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'clf__estimator__min_samples_leaf': [1, 2]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [15]:
evaluate_model(model, X_test, Y_test)

0 related
             precision    recall  f1-score   support

          0       0.66      0.46      0.55      2027
          1       0.85      0.93      0.89      6551

avg / total       0.80      0.82      0.81      8578

Precision: from the 7159 tweets labeled as related, 84.8% were actualy related
Recall: From the 6551 tweets that were actually related, 92.7% were labeled as related 

-------------------------------------------------------
1 request
             precision    recall  f1-score   support

          0       0.89      0.97      0.93      7112
          1       0.79      0.44      0.57      1466

avg / total       0.88      0.88      0.87      8578

Precision: from the 828 tweets labeled as request, 78.5% were actualy request
Recall: From the 1466 tweets that were actually request, 44.3% were labeled as request 

-------------------------------------------------------
2 offer
             precision    recall  f1-score   support

          0       1.00      1.00      1.0

  'precision', 'predicted', average, warn_for)


In [20]:
sys.argv[1:]

['-f',
 '/root/.local/share/jupyter/runtime/kernel-d65b6651-bea9-48e3-a9b6-d44c75684be7.json']

In [21]:
save_model(model,'/root/.local/share/jupyter/runtime/kernel-d65b6651-bea9-48e3-a9b6-d44c75684be7.json')

In [36]:
def main():
    if len(sys.argv) == 3:
        model_filepath = sys.argv[:-1]
        db_filepath = 'InsertDatabaseName.db'
        print('Loading data...\n    database location: {}'.format(db_filepath))
        X, Y = load_data(db_filepath)
        X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33)

        print('Building model...')
        model = build_model()

        print('Training model...')
        model.fit(X_train, Y_train)

        print('Evaluating model...')
        evaluate_model(model, X_test, Y_test)

        print('Saving model...   model: {}'.format(model_filepath))
        save_model(model, model_filepath)

        print('You just saved the trained model')

    else:
        print('Error, file locations not correct')

In [30]:
db_filepath, model_filepath = sys.argv[1:]

In [31]:
db_filepath

'-f'

In [37]:
main()

Loading data...
    database location: InsertDatabaseName.db
Building model...
Training model...
Evaluating model...
0 related
             precision    recall  f1-score   support

          0       0.62      0.46      0.53      1978
          1       0.85      0.92      0.88      6600

avg / total       0.80      0.81      0.80      8578

Precision: from the 7122 tweets labeled as related, 85.0% were actualy related
Recall: From the 6600 tweets that were actually related, 91.7% were labeled as related 

-------------------------------------------------------
1 request
             precision    recall  f1-score   support

          0       0.90      0.98      0.93      7100
          1       0.79      0.46      0.58      1478

avg / total       0.88      0.89      0.87      8578

Precision: from the 861 tweets labeled as request, 79.4% were actualy request
Recall: From the 1478 tweets that were actually request, 46.3% were labeled as request 

------------------------------------------

  'precision', 'predicted', average, warn_for)


TypeError: expected str, bytes or os.PathLike object, not list