# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [None]:
# import libraries
import numpy as np
import pandas as pd
from sqlalchemy import create_engine
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# download nltk data
nltk.download(['stopwords', 'wordnet', 'punkt'])

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
import os

In [None]:
# Change directory to the data directory, relative to the directory containing this .ipynb
os.chdir('data')

In [None]:
!ls

CleanedMessages.db
categories.csv
messages.csv
test_save.db


In [None]:
# load data from database
engine = create_engine('sqlite:///CleanedMessages.db')

with engine.connect() as conn:
    df = pd.read_sql('CategorizedMessages', conn)

### Data Exploration

In [None]:
df.head(3)

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# Verify there are no null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26216 entries, 0 to 26215
Data columns (total 40 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   id                      26216 non-null  int64 
 1   message                 26216 non-null  object
 2   original                10170 non-null  object
 3   genre                   26216 non-null  object
 4   related                 26216 non-null  int64 
 5   request                 26216 non-null  int64 
 6   offer                   26216 non-null  int64 
 7   aid_related             26216 non-null  int64 
 8   medical_help            26216 non-null  int64 
 9   medical_products        26216 non-null  int64 
 10  search_and_rescue       26216 non-null  int64 
 11  security                26216 non-null  int64 
 12  military                26216 non-null  int64 
 13  child_alone             26216 non-null  int64 
 14  water                   26216 non-null  int64 
 15  fo

In [None]:
df.genre.value_counts()

news      13054
direct    10766
social     2396
Name: genre, dtype: int64

Check values of categorization columns.

In [None]:
for col in df.columns[4:]:
    print(df[col].value_counts())
    print()

1    19906
0     6122
2      188
Name: related, dtype: int64

0    21742
1     4474
Name: request, dtype: int64

0    26098
1      118
Name: offer, dtype: int64

0    15356
1    10860
Name: aid_related, dtype: int64

0    24132
1     2084
Name: medical_help, dtype: int64

0    24903
1     1313
Name: medical_products, dtype: int64

0    25492
1      724
Name: search_and_rescue, dtype: int64

0    25745
1      471
Name: security, dtype: int64

0    25356
1      860
Name: military, dtype: int64

0    26216
Name: child_alone, dtype: int64

0    24544
1     1672
Name: water, dtype: int64

0    23293
1     2923
Name: food, dtype: int64

0    23902
1     2314
Name: shelter, dtype: int64

0    25811
1      405
Name: clothing, dtype: int64

0    25612
1      604
Name: money, dtype: int64

0    25918
1      298
Name: missing_people, dtype: int64

0    25341
1      875
Name: refugees, dtype: int64

0    25022
1     1194
Name: death, dtype: int64

0    22770
1     3446
Name: other_aid, dtype: int6

#### Multiple Classes in 'related' column

We see that while most columns have values in {0, 1} indicating false/true, the 'related' column has values from the set: {0, 1, 2}. 

I didn't find documentation that explained this, so investigate further.

What is the character of the messages in each of the categories of the 'related' column?

In [None]:
for r_val in [1, 0, 2]:
    print(f"\n============= val: {r_val} =============")
    sub_df = df[df.related==r_val]
    for ind in range(80):
        print(sub_df.message.iloc[ind])


Weather update - a cold front from Cuba that could pass over Haiti
Is the Hurricane over or is it not over
Looking for someone but no name
UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.
says: west side of Haiti, rest of the country today and tonight
Storm at sacred heart of jesus
Please, we need tents and water. We are in Silo, Thank you!
I am in Croix-des-Bouquets. We have health issues. They ( workers ) are in Santo 15. ( an area in Croix-des-Bouquets )
There's nothing to eat and water, we starving and thirsty.
I am in Thomassin number 32, in the area named Pyron. I would like to have some water. Thank God we are fine, but we desperately need water. Thanks
Let's do it together, need food in Delma 75, in didine area
More information on the 4636 number in order for me to participate. ( To see if I can use it )
A Comitee in Delmas 19, Rue ( street ) Janvier, Impasse Charite #2. We have about 500 people in a temporary shelter and we 

It seems that the 'related' value is 1 if the message is related to some disaster, and 0 otherwise. Messages with 'related' val = 2 include also untranslated messages and miscellaneous garbage.

Let's look at the values of the other categorization columns for each of the 3 values for 'related'.

In [None]:
# mean number of other flags per row when 'related' col val = 1
df[df.related==1].loc[:, 'request':].sum(axis=1).mean()

3.16693459258515

In [None]:
# mean number of other flags per row when 'related' col val = 0
df[df.related==0].loc[:, 'request':].sum(axis=1).mean()

0.0

In [None]:
# when 'related' col val = 2, none of the other flags are turned on.
df[df.related==2].loc[:, 'request':].sum(axis=1).sum()

0

**Conclusion:** Other categorization flags are on (= 1) for a row only if that row has a value of 1 for 'related'. 

#### Prevalence of URLs

In [None]:
url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

In [None]:
re.findall(url_regex, "Some text https://www.Udacity.coM/ more text.")

['https://www.Udacity.coM/']

In [None]:
re.findall(url_regex, 'no url here')

[]

In [None]:
# count number of messages with url
n_url_msg = 0
for msg in df.message:
    if re.search(url_regex, msg):
        n_url_msg += 1
n_url_msg

669

In [None]:
# count messages with URLs among messages that are related and not
is_related = df.related == 1
n_url_related = 0
for msg in df[is_related].message:
    if re.search(url_regex, msg):
        n_url_related += 1
        
n_url_unrelated = 0
for msg in df[~is_related].message:
    if re.search(url_regex, msg):
        n_url_unrelated += 1
        
print(f"n_url_related={n_url_related}, n_url_unrelated={n_url_unrelated}")
print(f"{100*n_url_related/df[is_related].shape[0]}%, {100*n_url_unrelated/df[~is_related].shape[0]}%")

n_url_related=552, n_url_unrelated=117
2.7730332563046316%, 1.8541996830427891%


The frequency of urls in related and unrelated messages are similar.

In [None]:
%%time
# number of unique urls
urls = []
for msg in df.message:
    urls.extend(re.findall(url_regex, msg))

Wall time: 70 ms


In [None]:
len(urls)

810

In [None]:
len(set(urls))

765

In [None]:
url_counts = pd.Series(urls).value_counts()
url_counts.head(24)

http://twitpic.com/16esd9                                                                                        8
http://bit.ly/a7zy8s                                                                                             7
http://twitpic.com/18deq7                                                                                        6
http://twitpic.com/16ad2g                                                                                        6
http://twitpic.com/15wu5u                                                                                        5
http://bit.ly/cs8BsY                                                                                             3
http://blip.fm/                                                                                                  3
http://bit.ly/a8pajh                                                                                             3
http://172.16.3.136/mymain2.php                                                 

### Split Data into Model Input and Target

#### Handle 'related' column values

In [None]:
# encode all non-related entries as 0, so we have 1 for related, 0 for non-related
df_2 = df.copy()
df_2.related = df_2.related.replace(2, 0)

In [None]:
df_2.related.value_counts()

1    19906
0     6310
Name: related, dtype: int64

#### Split Data

In [None]:
# Split into model input and categories
X = df_2['message']
Y = df_2.loc[:, 'related':]

In [None]:
X.head(3)

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
Name: message, dtype: object

In [None]:
print(Y.shape)
Y.head(3)

(26216, 36)


Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Write a tokenization function to process your text data

In [None]:
stop_words = stopwords.words("english")
lemmatizer = WordNetLemmatizer()
url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

In [None]:
def tokenize(text):
    # convert urls to url token
    text = re.sub(url_regex, 'zzurl', text)
    
    # normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # tokenize text
    tokens = word_tokenize(text)
    
    # lemmatize and remove stop words
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    return tokens

In [None]:
# examine tokenize() behavior with some sample strings
for msg in X.iloc[:12]:
    print(msg, '\n    ', tokenize(msg), '\n')

Weather update - a cold front from Cuba that could pass over Haiti 
     ['weather', 'update', 'cold', 'front', 'cuba', 'could', 'pas', 'haiti'] 

Is the Hurricane over or is it not over 
     ['hurricane'] 

Looking for someone but no name 
     ['looking', 'someone', 'name'] 

UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately. 
     ['un', 'report', 'leogane', '80', '90', 'destroyed', 'hospital', 'st', 'croix', 'functioning', 'need', 'supply', 'desperately'] 

says: west side of Haiti, rest of the country today and tonight 
     ['say', 'west', 'side', 'haiti', 'rest', 'country', 'today', 'tonight'] 

Information about the National Palace- 
     ['information', 'national', 'palace'] 

Storm at sacred heart of jesus 
     ['storm', 'sacred', 'heart', 'jesus'] 

Please, we need tents and water. We are in Silo, Thank you! 
     ['please', 'need', 'tent', 'water', 'silo', 'thank'] 

I would like to receive the messages, thank you 
     ['

In [None]:
# Check behavior for string with urls
for msg in X.iloc[:10000]:
    if re.search(url_regex, msg):
        print(msg, '\n    ', tokenize(msg), '\n')

If you want to find a Job at an NGO or the Government, upload your resume at http://www.jobpaw.com/  
     ['want', 'find', 'job', 'ngo', 'government', 'upload', 'resume', 'zzurl'] 

NOTES: WHAT A JERK ,ALL HAITIANS DONT HAVE ANYTHING TO EAT ,AND ''HE'' ORDERING 3 DAYS WITHOUT FOOD LIKE SUPPORT FOR THOSE WITHOUT FOOD? http://welcome.topuertorico.org/government.shtml 
     ['note', 'jerk', 'haitian', 'dont', 'anything', 'eat', 'ordering', '3', 'day', 'without', 'food', 'like', 'support', 'without', 'food', 'zzurl'] 

http://wap.sina.comhttp://wap.sina.com  
     ['zzurl'] 

Nokia.com http://ea.mobile.nokia.com/ea/graphics  
     ['nokia', 'com', 'zzurl'] 

BEGIN:VBKM VERSION:1.0 TITLE:Digicel Live Ha URL:http://172.16.3.136/mymain2.php BEGIN:ENV X-IRMC-URLQUOTED-PRINTABLE: InternetShortcut  URLhttp://172.16.3.136/mymain2.php END:ENV END:VBKM   
     ['begin', 'vbkm', 'version', '1', '0', 'title', 'digicel', 'live', 'ha', 'url', 'zzurl', 'begin', 'env', 'x', 'irmc', 'urlquoted', 'printab

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

#### Pipeline

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion

In [None]:
pipeline = Pipeline([
    ('text_features', TfidfVectorizer(tokenizer=tokenize)),
    ('rfc', MultiOutputClassifier(RandomForestClassifier(), n_jobs=2))
])

In [None]:
# View available pipeline params
pipeline.get_params()

{'memory': None,
 'steps': [('text_features',
   TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                   dtype=<class 'numpy.float64'>, encoding='utf-8',
                   input='content', lowercase=True, max_df=1.0, max_features=None,
                   min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                   smooth_idf=True, stop_words=None, strip_accents=None,
                   sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                   tokenizer=<function tokenize at 0x00000239E63684C8>,
                   use_idf=True, vocabulary=None)),
  ('rfc',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True,
                                                          ccp_alpha=0.0,
                                                          class_weight=None,
                                                          criterion='gini',
                                                          max_depth=No

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

Train pipeline with chosen parameters.

In [None]:
%%script echo skipping cell
pipeline.set_params(text_features__max_df=0.8,
                    text_features__min_df=2.0/10000,
                    text_features__max_features=10000,
                    rfc__estimator__n_estimators=100,
                    rfc__estimator__min_samples_split=3).fit(X_train, Y_train)

Pipeline(memory=None,
         steps=[('text_features',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=0.8, max_features=10000,
                                 min_df=0.0002, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 t...
                                                                        ccp_alpha=0.0,
                                                                        class_weight=None,
                                                                        criterion='gini',
                                  

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [None]:
%%script echo skipping cell
predicted_test = pipeline.predict(X_test)

In [None]:
%%script echo skipping cell
col_names = Y_test.columns

In [None]:
%%script echo skipping cell
for y_true, y_pred, colname in zip(Y_test.values.T, predicted_test.T, col_names):
    print(colname)
    print('='*len(colname))
    print(classification_report(y_true, y_pred))
    print()

related
              precision    recall  f1-score   support

           0       0.67      0.48      0.56      1544
           1       0.85      0.93      0.89      5010

    accuracy                           0.82      6554
   macro avg       0.76      0.70      0.72      6554
weighted avg       0.81      0.82      0.81      6554


request
              precision    recall  f1-score   support

           0       0.91      0.98      0.94      5456
           1       0.83      0.50      0.63      1098

    accuracy                           0.90      6554
   macro avg       0.87      0.74      0.78      6554
weighted avg       0.89      0.90      0.89      6554


offer
=====
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6530
           1       0.00      0.00      0.00        24

    accuracy                           1.00      6554
   macro avg       0.50      0.50      0.50      6554
weighted avg       0.99      1.00      0.99 

  _warn_prf(average, modifier, msg_start, len(result))




food
====
              precision    recall  f1-score   support

           0       0.97      0.98      0.97      5833
           1       0.80      0.72      0.76       721

    accuracy                           0.95      6554
   macro avg       0.88      0.85      0.87      6554
weighted avg       0.95      0.95      0.95      6554


shelter
              precision    recall  f1-score   support

           0       0.95      0.99      0.97      6004
           1       0.81      0.48      0.60       550

    accuracy                           0.95      6554
   macro avg       0.88      0.73      0.79      6554
weighted avg       0.94      0.95      0.94      6554


clothing
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      6451
           1       0.84      0.20      0.33       103

    accuracy                           0.99      6554
   macro avg       0.91      0.60      0.66      6554
weighted avg       0.99      0.99      0.98



direct_report
              precision    recall  f1-score   support

           0       0.86      0.98      0.92      5257
           1       0.81      0.37      0.50      1297

    accuracy                           0.86      6554
   macro avg       0.84      0.67      0.71      6554
weighted avg       0.85      0.86      0.84      6554




### 6. Improve your model
Use grid search to find better parameters. 

#### Explore values for ngram and max_features

In [None]:
# Parameters for grid search
cv_params = {
    'text_features__ngram_range': ((1, 1), (1, 2)),
    'text_features__max_df': (0.8,), #(0.5, 0.75, 1.0),
    'text_features__max_features': (10000, None), #(5000, 10000, None),
    'text_features__use_idf': (True,), # (True, False),
    'rfc__estimator__n_estimators': [100], # [50, 100, 200],
    'rfc__estimator__min_samples_split': [3], # [2, 3, 4]
}

cv = GridSearchCV(pipeline, param_grid=cv_params)

In [None]:
%%script echo skipping cell
%%time
cv.fit(X_train, Y_train)

Wall time: 1h 1min 47s


GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('text_features',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=0.8,
                                                        max_features=10000,
                                                        min_df=0.0002,
                                                        ngram_range=(1, 1),
                                        

In [None]:
%%script echo skipping cell
cv.best_params_

{'rfc__estimator__min_samples_split': 3,
 'rfc__estimator__n_estimators': 100,
 'text_features__max_df': 0.8,
 'text_features__max_features': None,
 'text_features__ngram_range': (1, 2),
 'text_features__use_idf': True}

In [None]:
%%script echo skipping cell
pd.DataFrame(cv.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_rfc__estimator__min_samples_split,param_rfc__estimator__n_estimators,param_text_features__max_df,param_text_features__max_features,param_text_features__ngram_range,param_text_features__use_idf,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,149.0898,2.333451,11.227994,0.461426,3,100,0.8,10000.0,"(1, 1)",True,"{'rfc__estimator__min_samples_split': 3, 'rfc_...",0.290364,0.289855,0.277467,0.28179,0.26882,0.281659,0.008067,4
1,161.480633,1.983064,11.380506,0.100678,3,100,0.8,10000.0,"(1, 2)",True,"{'rfc__estimator__min_samples_split': 3, 'rfc_...",0.2985,0.293415,0.28001,0.286368,0.276195,0.286898,0.008238,2
2,147.773228,1.111228,10.908775,0.320882,3,100,0.8,,"(1, 1)",True,"{'rfc__estimator__min_samples_split': 3, 'rfc_...",0.29494,0.291889,0.274161,0.288657,0.263733,0.282676,0.011852,3
3,181.596993,11.550008,13.276466,2.423728,3,100,0.8,,"(1, 2)",True,"{'rfc__estimator__min_samples_split': 3, 'rfc_...",0.299263,0.302059,0.284842,0.293489,0.277213,0.291373,0.00921,1


#### Explore parameter values for max_df

In [None]:
cv_params = {
    'text_features__ngram_range': ((1, 2),),
    'text_features__max_df': (0.8, 0.9, 1.0), #(0.5, 0.75, 1.0),
    'text_features__max_features': (None,), #(5000, 10000, None),
    'rfc__estimator__n_estimators': [100], # [50, 100, 200],
    'rfc__estimator__min_samples_split': [3] # [2, 3, 4]
}

cv = GridSearchCV(pipeline, param_grid=cv_params, cv=3)

In [None]:
%%script echo skipping cell
%%time
cv.fit(X_train, Y_train)

Wall time: 30min 55s


GridSearchCV(cv=3, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('text_features',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=0.8,
                                                        max_features=10000,
                                                        min_df=0.0002,
                                                        ngram_range=(1, 1),
                                           

In [None]:
%%script echo skipping cell
cv.best_params_

{'rfc__estimator__min_samples_split': 3,
 'rfc__estimator__n_estimators': 100,
 'text_features__max_df': 1.0,
 'text_features__max_features': None,
 'text_features__ngram_range': (1, 2)}

In [None]:
%%script echo skipping cell
pd.DataFrame(cv.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_rfc__estimator__min_samples_split,param_rfc__estimator__n_estimators,param_text_features__max_df,param_text_features__max_features,param_text_features__ngram_range,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,144.819614,6.677293,16.512791,0.834994,3,100,0.8,,"(1, 2)","{'rfc__estimator__min_samples_split': 3, 'rfc_...",0.287611,0.285627,0.276625,0.283288,0.00478,3
1,156.813878,13.763455,16.579252,0.6979,3,100,0.9,,"(1, 2)","{'rfc__estimator__min_samples_split': 3, 'rfc_...",0.288526,0.29051,0.276167,0.285068,0.006345,2
2,157.639427,4.676143,17.892474,1.905811,3,100,1.0,,"(1, 2)","{'rfc__estimator__min_samples_split': 3, 'rfc_...",0.289136,0.287305,0.278761,0.285068,0.004522,1


#### Explore parameter values for random forest

In [None]:
# Parameters for grid search
cv_params = {
    'text_features__ngram_range': ((1, 2),),
    'rfc__estimator__n_estimators': [100, 150], # [50, 100, 200],
    'rfc__estimator__min_samples_split': [3, 4], # [2, 3, 4]
}

cv = GridSearchCV(pipeline, param_grid=cv_params, cv=3)

In [None]:
%%script echo skipping cell
%%time
cv.fit(X_train, Y_train)

Wall time: 1h 51min 59s


GridSearchCV(cv=3, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('text_features',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                 

In [None]:
%%script echo skipping cell
pd.DataFrame(cv.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_rfc__estimator__min_samples_split,param_rfc__estimator__n_estimators,param_text_features__ngram_range,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,351.798884,8.290344,20.927846,0.190404,3,100,"(1, 2)","{'rfc__estimator__min_samples_split': 3, 'rfc_...",0.251755,0.256332,0.26747,0.258519,0.0066,3
1,524.421279,10.989128,29.991192,0.316006,3,150,"(1, 2)","{'rfc__estimator__min_samples_split': 3, 'rfc_...",0.2574,0.263656,0.268843,0.2633,0.004679,1
2,320.557537,6.102964,20.622887,0.190247,4,100,"(1, 2)","{'rfc__estimator__min_samples_split': 4, 'rfc_...",0.253128,0.257553,0.26274,0.257807,0.003928,4
3,529.091756,38.520307,45.809671,22.776831,4,150,"(1, 2)","{'rfc__estimator__min_samples_split': 4, 'rfc_...",0.252212,0.266555,0.27098,0.263249,0.00801,2


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [None]:
def multioutput_classification_report(trained_model, x_test, y_test):
    """Print a classification report for each of the outputs of a multioutput classification model.
    
    Args:
     - trained_model: has a predict() method
     - x_test: model inputs from the test set
     - y_test: model outputs from the test set (a dataframe with columns naming the outputs)
    """
    predicted = trained_model.predict(x_test)
    col_names = y_test.columns
    
    for y_true, y_pred, colname in zip(y_test.values.T, predicted.T, col_names):
        print(colname)
        print('='*len(colname))
        print(classification_report(y_true, y_pred))
        print()

In [None]:
%%script echo skipping cell
multioutput_classification_report(cv, X_test, Y_test)

related
              precision    recall  f1-score   support

           0       0.69      0.46      0.56      1603
           1       0.84      0.93      0.89      4951

    accuracy                           0.82      6554
   macro avg       0.77      0.70      0.72      6554
weighted avg       0.81      0.82      0.81      6554


request
              precision    recall  f1-score   support

           0       0.91      0.98      0.94      5470
           1       0.81      0.50      0.62      1084

    accuracy                           0.90      6554
   macro avg       0.86      0.74      0.78      6554
weighted avg       0.89      0.90      0.89      6554


offer
=====
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6525
           1       0.00      0.00      0.00        29

    accuracy                           1.00      6554
   macro avg       0.50      0.50      0.50      6554
weighted avg       0.99      1.00      0.99 

  _warn_prf(average, modifier, msg_start, len(result))




missing_people
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      6468
           1       1.00      0.01      0.02        86

    accuracy                           0.99      6554
   macro avg       0.99      0.51      0.51      6554
weighted avg       0.99      0.99      0.98      6554


refugees
              precision    recall  f1-score   support

           0       0.97      1.00      0.98      6327
           1       0.91      0.04      0.08       227

    accuracy                           0.97      6554
   macro avg       0.94      0.52      0.53      6554
weighted avg       0.96      0.97      0.95      6554


death
=====
              precision    recall  f1-score   support

           0       0.96      1.00      0.98      6270
           1       0.87      0.19      0.31       284

    accuracy                           0.96      6554
   macro avg       0.92      0.59      0.65      6554
weighted avg       0.96      0.96 

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.