# Disaster Response Project

- From step 1-8 we have cleaned the data according to our requirement. Which included one-hot encoding the data with merging and removing the useless data.
- And then the second part of the project contains from step 9-17 which included creating a ML pipeline

### 1. Import libraries and load datasets.
- Import Python libraries
- Load `data/disaster_messages.csv` into a dataframe and inspect the first few lines.
- Load `data/disaster_categories.csv` into a dataframe and inspect the first few lines.

In [1]:
# import libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

import re
import pickle
import nltk

nltk.download('punkt')
nltk.download('stopwords')


# Libraries used in the second part of the program i.e. pipelining
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

import warnings

warnings.simplefilter('ignore')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\naman\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\naman\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [2]:
# load messages dataset
messages = pd.read_csv('data\disaster_messages.csv')
messages.head()

Unnamed: 0,id,message,original,genre
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct


In [3]:
# load categories dataset
categories = pd.read_csv('data/disaster_categories.csv')
categories.head()

Unnamed: 0,id,categories
0,2,related-1;request-0;offer-0;aid_related-0;medi...
1,7,related-1;request-0;offer-0;aid_related-1;medi...
2,8,related-1;request-0;offer-0;aid_related-0;medi...
3,9,related-1;request-1;offer-0;aid_related-1;medi...
4,12,related-1;request-0;offer-0;aid_related-0;medi...


### 2. Merging Both datasets.
- Merge the messages and categories datasets using the common id
- Assign this combined dataset to `df`, which will be cleaned

In [4]:
# merge datasets
df = messages.merge(categories, how = 'left', on = ['id'])
df.head()

Unnamed: 0,id,message,original,genre,categories
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,related-1;request-0;offer-0;aid_related-0;medi...
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,related-1;request-0;offer-0;aid_related-1;medi...
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,related-1;request-0;offer-0;aid_related-0;medi...
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,related-1;request-1;offer-0;aid_related-1;medi...
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,related-1;request-0;offer-0;aid_related-0;medi...


### 3. Split `categories` column into separate different category columns.

- Split the values in the `categories` column on the `;` character so that each value becomes a separate column. You'll find [this method](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.Series.str.split.html) very helpful! Make sure to set `expand=True`.
- Use the first row of categories dataframe to create column names for the categories data.
- Rename columns of `categories` with new column names.

In [5]:
# create a dataframe of the 36 individual category columns
categories = df['categories'].str.split(';', expand = True)
categories.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,26,27,28,29,30,31,32,33,34,35
0,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
1,related-1,request-0,offer-0,aid_related-1,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-1,floods-0,storm-1,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
2,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
3,related-1,request-1,offer-0,aid_related-1,medical_help-0,medical_products-1,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
4,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0


In [6]:
# select the first row of the categories dataframe
row = categories.iloc[0]

# use this row to extract a list of new column names for categories.
# one way is to apply a lambda function that takes everything 
# up to the second to last character of each string with slicing
category_colnames = row.transform(lambda x: x[:-2]).tolist()
print(category_colnames)

['related', 'request', 'offer', 'aid_related', 'medical_help', 'medical_products', 'search_and_rescue', 'security', 'military', 'child_alone', 'water', 'food', 'shelter', 'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings', 'electricity', 'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure', 'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold', 'other_weather', 'direct_report']


In [7]:
# rename the columns of `categories`
categories.columns = category_colnames
categories.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
1,related-1,request-0,offer-0,aid_related-1,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-1,floods-0,storm-1,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
2,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
3,related-1,request-1,offer-0,aid_related-1,medical_help-0,medical_products-1,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
4,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0


### 4. Convert category values to just numbers 0 or 1.

- Iterate through the category columns in df to keep only the last character of each string (the 1 or 0). For example, `related-0` becomes `0`, `related-1` becomes `1`. Convert the string to a numeric value.

In [8]:
for column in categories:
    # set each value to be the last character of the string
    categories[column] = categories[column].transform(lambda x: x[-1:])
    
    # convert column from string to numeric
    categories[column] = pd.to_numeric(categories[column])
categories.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
# Confirm values in each column of the categories dataframe are, in fact, only 0's and 1's.

for column in categories:
    if len(np.unique(categories[column])) != 2:
        print(column, np.unique(categories[column]))

related [0 1 2]
child_alone [0]


In [10]:
# Count number of entries where related == 2
categories['related'].value_counts()

1    20042
0     6140
2      204
Name: related, dtype: int64

There are 204 cases where the message has a "related" value of 2. We shall explore what these messages look like, and determine what to do with them, in the next section.

All values in the child_alone column are equal to 0. As a result, based on this data, the best model we will be able to devise to predict whether or not the message indicates that a child is alone will just predict that it is not 100% of the time. As such a model is pointless, we will drop this column from the categories dataframe.

In [11]:
# Drop child_alone from categories dataframe.
categories.drop('child_alone', axis = 1, inplace = True)

categories.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 5. Replace `categories` column in `df` with new category columns.
- Drop the categories column from the df dataframe since it is no longer needed.
- Concatenate df and categories data frames.

In [12]:
# drop the original categories column from `df`
df.drop('categories', axis = 1, inplace = True)

df.head()

Unnamed: 0,id,message,original,genre
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct


In [13]:
# concatenate the original dataframe with the new `categories` dataframe
df = pd.concat([df, categories], axis = 1)
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 6. Remove duplicates.
- Check how many duplicates are in this dataset.
- Drop the duplicates.
- Confirm duplicates were removed.

In [14]:
# check number of duplicates
print('There are', sum(df.duplicated()), 'duplicates in the dataset.')

There are 170 duplicates in the dataset.


In [15]:
# drop duplicates
df.drop_duplicates(inplace = True)

# check number of duplicates
print('There are', sum(df.duplicated()), 'duplicates in the dataset.')

There are 0 duplicates in the dataset.


### 7. Explore messages with a "related" value of 2.

In [16]:
# Get messages with a related value of 2
df['message'][df['related'] == 2]

117      Dans la zone de Saint Etienne la route de Jacm...
221      . .. i with limited means. Certain patients co...
307      The internet caf Net@le that's by the Dal road...
462      Bonsoir, on est a bon repos aprs la compagnie ...
578      URGENT CRECHE ORPHANAGE KAY TOUT TIMOUN CROIX ...
657      elle est vraiment malade et a besoin d'aide. u...
889      no authority has passed by to see us. We don't...
903      It's Over in Gressier. The population in the a...
931      we sleep with the baby. Thanks in advance for ...
937      I need help in Jrmie because I was in Port-au-...
939      fsa pou mwen s v p map mouri mwen gen tout po ...
1234     .. Gonaives, in a place called Canal Bois in F...
1256     GEN YON TIBEBE KI MALAD NAN KOU PASKE BLOK TON...
1317     don't understand the first part. .. understand...
1409     Est-ce que ya monde qui aller U. s. a et qui o...
1506     Gens ont information qui dit que si on moins c...
1694         This is my address : Cersine 8 Prolong. .. 

Many of the messages with a related value of 2 appear to be in languages other than English. Based on this, we infer that a related value of 2 corresponds to uncertainty as to whether or not the message is related. As a result, these records are not really useful when it comes to building an ML model, so we will drop these records from the dataset.

In [17]:
# Remove rows with a related value of 2 from the dataset
df = df[df['related'] != 2]

### 8. Save the clean dataset into an sqlite database.
You can do this with pandas [`to_sql` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html) combined with the SQLalchemy library.

In [18]:
engine = create_engine('sqlite:///Messages.db')
df.to_sql('Messages', engine, index=False, if_exists='replace')

## Pipelining The Data
The project has been divided into two parts 
- The first part dealt with all the cleaning of the data 
- The second will deal with training and pipelining the model

### 9. Load data from database.
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [19]:
# load data from database
engine = create_engine('sqlite:///Messages.db')
df = pd.read_sql("SELECT * FROM Messages", engine)

X = df['message']
Y = df.drop(['id', 'message', 'original', 'genre'], axis = 1)

### 10. Write a tokenization function to process your text data

In [20]:
def tokenize(text):
    """Normalize, tokenize and stem text string
    
    Args:
    text: string. String containing message for processing
       
    Returns:
    stemmed: list of strings. List containing normalized and stemmed word tokens
    """
    # Convert text to lowercase and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # Tokenize words
    tokens = word_tokenize(text)   # using the inbuilt functions of nltk library
    
    # Stem word tokens and remove stop words
    stemmer = PorterStemmer()
    stop_words = stopwords.words("english")
    
    stemmed = [stemmer.stem(word) for word in tokens if word not in stop_words]
    
    return stemmed

### 11. Build a machine learning pipeline
- You'll find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [21]:
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 12. Train pipeline
- Split data into train and test sets
- Train pipeline

In [22]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state = 1) # split the data into training and test set

np.random.seed(17)
pipeline.fit(X_train, Y_train) # fitting the training data on the pipeline

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip..._score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=None))])

### 13. Test your model
Report the f1 score, precision and recall on both the training set and the test set. You can use sklearn's `classification_report` function here. 

In [23]:
def get_eval_metrics(actual, predicted, col_names):
    """Calculate evaluation metrics for ML model
    
    Args:
    actual: array. Array containing actual labels.
    predicted: array. Array containing predicted labels.
    col_names: list of strings. List containing names for each of the predicted fields.
       
    Returns:
    metrics_df: dataframe. Dataframe containing the accuracy, precision, recall 
    and f1 score for a given set of actual and predicted labels.
    """
    metrics = []
    
    # Calculate evaluation metrics for each set of labels
    for i in range(len(col_names)):
        accuracy = accuracy_score(actual[:, i], predicted[:, i])
        precision = precision_score(actual[:, i], predicted[:, i])
        recall = recall_score(actual[:, i], predicted[:, i])
        f1 = f1_score(actual[:, i], predicted[:, i])
        
        metrics.append([accuracy, precision, recall, f1])
    
    # Create dataframe containing metrics
    metrics = np.array(metrics)
    metrics_df = pd.DataFrame(data = metrics, index = col_names, columns = ['Accuracy', 'Precision', 'Recall', 'F1'])
      
    return metrics_df    

In [None]:
# Calculate evaluation metrics for training set
Y_train_pred = pipeline.predict(X_train)
col_names = list(Y.columns.values)

print(get_eval_metrics(np.array(Y_train), Y_train_pred, col_names))

                        Accuracy  Precision    Recall        F1
related                 0.990267   0.992651  0.994645  0.993647
request                 0.987603   0.997442  0.930233  0.962666
offer                   0.998617   1.000000  0.715789  0.834356
aid_related             0.984325   0.994847  0.967608  0.981039
medical_help            0.988935   0.999260  0.862620  0.925926
medical_products        0.992367   0.998817  0.850806  0.918889
search_and_rescue       0.993955   0.995624  0.796848  0.885214
security                0.995543   1.000000  0.749280  0.856672
military                0.994980   0.996317  0.849294  0.916949
water                   0.995031   0.999139  0.923567  0.959868
food                    0.995390   0.998111  0.960891  0.979147
shelter                 0.993545   0.997526  0.929683  0.962411
clothing                0.998207   1.000000  0.886364  0.939759
money                   0.995492   1.000000  0.809935  0.894988
missing_people          0.996773   1.000

In [None]:
# Calculate evaluation metrics for test set
Y_test_pred = pipeline.predict(X_test)

eval_metrics0 = get_eval_metrics(np.array(Y_test), Y_test_pred, col_names)
print(eval_metrics0)

                        Accuracy  Precision    Recall        F1
related                 0.805133   0.846543  0.909603  0.876941
request                 0.887967   0.795764  0.469643  0.590679
offer                   0.996465   0.000000  0.000000  0.000000
aid_related             0.744275   0.734845  0.592758  0.656198
medical_help            0.920240   0.500000  0.067437  0.118846
medical_products        0.953896   0.666667  0.130841  0.218750
search_and_rescue       0.976026   0.363636  0.026144  0.048780
security                0.980636   0.000000  0.000000  0.000000
military                0.966959   0.785714  0.049327  0.092827
water                   0.950515   0.809211  0.295673  0.433099
food                    0.937145   0.816901  0.560773  0.665029
shelter                 0.931919   0.772000  0.333333  0.465621
clothing                0.986476   0.846154  0.113402  0.200000
money                   0.978331   0.500000  0.042553  0.078431
missing_people          0.987245   0.000

Although test accuracy is high for all categories, for the majority of categories, the F1 score is unacceptably low. This is likely due to the unbalanced nature of the dataset, as is evidenced by the following:

In [None]:
# Calculation the proportion of each column that have label == 1
Y.sum()/len(Y)

related                   0.764792
request                   0.171892
offer                     0.004534
aid_related               0.417243
medical_help              0.080068
medical_products          0.050446
search_and_rescue         0.027816
security                  0.018096
military                  0.033041
water                     0.064239
food                      0.112302
shelter                   0.088904
clothing                  0.015560
money                     0.023206
missing_people            0.011449
refugees                  0.033618
death                     0.045874
other_aid                 0.132396
infrastructure_related    0.065506
transport                 0.046143
buildings                 0.051214
electricity               0.020440
tools                     0.006109
hospitals                 0.010873
shops                     0.004610
aid_centers               0.011872
other_infrastructure      0.044222
weather_related           0.280352
floods              

In many cases, fewer than 5% of the dataset have a label of 1, making it more difficult for any model to predict these cases than if the data were balanced. 

Ideally, we should have used stratified sampling to create the train and test sets (this is what we would have done had there just been one column in the y dataset). However, due to the fact that we have multiple labels for each datapoint, this is not practical. We would effectively have to create a separate train and test set for each set of y-labels, which would then mean that we would have to fit a separate model to each of the y-columns. This is not something that we wish to do.

### 14. Improve your model
Use grid search to find better parameters. 

In [None]:
# Define performance metric for use in grid search scoring object
def performance_metric(y_true, y_pred):
    """Calculate median F1 score for all of the output classifiers
    
    Args:
    y_true: array. Array containing actual labels.
    y_pred: array. Array containing predicted labels.
        
    Returns:
    score: float. Median F1 score for all of the output classifiers
    """
    f1_list = []
    for i in range(np.shape(y_pred)[1]):
        f1 = f1_score(np.array(y_true)[:, i], y_pred[:, i])
        f1_list.append(f1)
        
    score = np.median(f1_list)
    return score

We have chosen to use the median F1 score for all of the output classifiers, rather than the mean, to avoid the situation where we are selecting a set of parameters that result in a small number of the output classifiers having very high test F1 scores, but the majority of the output classifiers having test F1 scores close to zero.

In [None]:
# Create grid search object

parameters = {'vect__min_df': [1, 5],
              'tfidf__use_idf':[True, False],
              'clf__estimator__n_estimators':[10, 25], 
              'clf__estimator__min_samples_split':[2, 5, 10]}

scorer = make_scorer(performance_metric)
cv = GridSearchCV(pipeline, param_grid = parameters, scoring = scorer, verbose = 10)

# Find best parameters
np.random.seed(81)
tuned_model = cv.fit(X_train, Y_train)

Fitting 3 folds for each of 24 candidates, totalling 72 fits
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__min_df=1 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__min_df=1, score=0.12432432432432435, total= 1.4min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__min_df=1 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.9min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__min_df=1, score=0.10619469026548672, total= 1.3min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__min_df=1 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  3.6min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__min_df=1, score=0.10071942446043167, total= 1.3min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  5.3min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__min_df=5, score=0.171875, total= 1.1min
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed: 13.7min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__min_df=5, score=0.18770226537216828, total=  31.9s
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed: 14.4min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=True, vect__min_df=5, score=0.16666666666666669, total=  29.1s
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=False, vect__min_df=1 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 15.1min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=False, vect__min_df=1, score=0.09999999999999999, total=  35.0s
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=False, vect__min_df=1 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed: 15.9min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=False, vect__min_df=1, score=0.15458937198067632, total=  34.4s
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=False, vect__min_df=1 


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed: 16.7min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=False, vect__min_df=1, score=0.10671256454388986, total=  33.6s
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=False, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed: 17.4min remaining:    0.0s


[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=False, vect__min_df=5, score=0.1722488038277512, total=  28.8s
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=False, vect__min_df=5 
[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=False, vect__min_df=5, score=0.14864864864864866, total=  27.2s
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=False, vect__min_df=5 
[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=10, tfidf__use_idf=False, vect__min_df=5, score=0.1316872427983539, total=  27.3s
[CV] clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=1 
[CV]  clf__estimator__min_samples_split=2, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=1, score=0.18210361067503925, total= 1.1min
[CV] clf__estimator__min_samples_split=2, clf

[CV]  clf__estimator__min_samples_split=5, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=5, score=0.19576719576719576, total=123.4min
[CV] clf__estimator__min_samples_split=5, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=5 
[CV]  clf__estimator__min_samples_split=5, clf__estimator__n_estimators=25, tfidf__use_idf=True, vect__min_df=5, score=0.20108695652173914, total=  48.6s
[CV] clf__estimator__min_samples_split=5, clf__estimator__n_estimators=25, tfidf__use_idf=False, vect__min_df=1 
[CV]  clf__estimator__min_samples_split=5, clf__estimator__n_estimators=25, tfidf__use_idf=False, vect__min_df=1, score=0.11987381703470032, total=  52.2s
[CV] clf__estimator__min_samples_split=5, clf__estimator__n_estimators=25, tfidf__use_idf=False, vect__min_df=1 
[CV]  clf__estimator__min_samples_split=5, clf__estimator__n_estimators=25, tfidf__use_idf=False, vect__min_df=1, score=0.12209302325581395, total=  51.0s
[CV] clf__estimator__min_samples_split=5, c

[CV]  clf__estimator__min_samples_split=10, clf__estimator__n_estimators=25, tfidf__use_idf=False, vect__min_df=5, score=0.1752988047808765, total=  36.8s


[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed: 197.7min finished


In [None]:
# Get results of grid search
tuned_model.cv_results_

{'mean_fit_time': array([  63.61591729,   33.95219954,   28.21514964,   22.01785533,
          54.70001976,   41.17596547,   53.96547127,   43.25740846,
          23.73056022,   23.22256867,   24.94887368,   21.2991364 ,
          45.52840304, 2490.4673032 ,   43.29930361,   36.83892083,
          22.34832247,   20.3313597 ,   21.70344218,   20.05396962,
          39.68346953,   33.31727378,   38.53081473,   31.73518403]),
 'std_fit_time': array([2.67827825e+00, 1.39533099e+01, 7.37033904e-01, 4.44593580e-01,
        1.76675448e+00, 8.52873913e-01, 3.60720715e+00, 5.68523530e-01,
        2.03496680e-01, 4.93566517e-01, 1.44358752e+00, 1.05161333e+00,
        1.65318129e+00, 3.46798912e+03, 2.93365149e-01, 1.68384930e+00,
        5.13420757e-01, 1.83620874e+00, 1.10897232e+00, 1.89836108e-01,
        1.65170680e+00, 6.20059993e-01, 8.37052190e-01, 1.12268558e+00]),
 'mean_score_time': array([15.004016  ,  8.80250963,  6.21320566,  5.84154232,  8.04844904,
         7.22265601,  8.2390620

In [None]:
# Best mean test score
np.max(tuned_model.cv_results_['mean_test_score'])

0.20679501377175796

In [None]:
# Parameters for best mean test score
tuned_model.best_params_

{'clf__estimator__min_samples_split': 10,
 'clf__estimator__n_estimators': 10,
 'tfidf__use_idf': True,
 'vect__min_df': 5}

The best results (with regard to median F1 score) were achieved using the following parameters:
* CountVectorizer minimum df = 5
* TfidfTransformer use_idf = True
* Random Forest Classifier number of estimators = 10
* Random Forest Classifier minimum samples split = 10

### 15. Test your model
Show the accuracy, precision, and recall of the tuned model.

In [None]:
# Calculate evaluation metrics for test set
tuned_pred_test = tuned_model.predict(X_test)

eval_metrics1 = get_eval_metrics(np.array(Y_test), tuned_pred_test, col_names)

print(eval_metrics1)

                        Accuracy  Precision    Recall        F1
related                 0.809897   0.840577  0.926716  0.881547
request                 0.890733   0.778990  0.509821  0.616298
offer                   0.996465   0.000000  0.000000  0.000000
aid_related             0.751959   0.702087  0.690556  0.696274
medical_help            0.920547   0.505814  0.167630  0.251809
medical_products        0.957123   0.756098  0.193146  0.307692
search_and_rescue       0.978792   0.777778  0.137255  0.233333
security                0.980636   0.250000  0.008065  0.015625
military                0.967881   0.652174  0.134529  0.223048
water                   0.956201   0.822660  0.401442  0.539580
food                    0.939142   0.788732  0.618785  0.693498
shelter                 0.935915   0.745455  0.424870  0.541254
clothing                0.988167   0.777778  0.288660  0.421053
money                   0.979253   0.650000  0.092199  0.161491
missing_people          0.987245   0.000

In [None]:
# Get summary stats for first model
eval_metrics0.describe()

Unnamed: 0,Accuracy,Precision,Recall,F1
count,35.0,35.0,35.0,35.0
mean,0.9424,0.564538,0.187315,0.241425
std,0.057651,0.333674,0.243477,0.269195
min,0.744275,0.0,0.0,0.0
25%,0.931689,0.381818,0.016632,0.032458
50%,0.955586,0.733333,0.06474,0.116667
75%,0.980559,0.813056,0.324296,0.453287
max,0.996465,1.0,0.909603,0.876941


In [None]:
# Get summary stats for tuned model
eval_metrics1.describe()

Unnamed: 0,Accuracy,Precision,Recall,F1
count,35.0,35.0,35.0,35.0
mean,0.945009,0.578224,0.248886,0.311893
std,0.055767,0.301351,0.261627,0.273151
min,0.751959,0.0,0.0,0.0
25%,0.937529,0.472225,0.033616,0.064037
50%,0.957123,0.71049,0.149378,0.242424
75%,0.981712,0.783861,0.413156,0.540417
max,0.996465,1.0,0.926716,0.881547


Tuning the model parameters has resulted in an increase in the median and mean (test) F1 score for the output classifiers. However, it is still the case that 50% of the ouput classifiers have an F1 score of less than 0.24, and 25% have an F1 score of less than 0.064. This is due to low recall values (i.e. the proportion of positive points that were correctly labelled). Ideally, we would like to try to improve on this.

### 16. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

To try to improve the model further, we will change the Random Forest Classifier in the pipeline to a polynomial SVM classifier. SVMs are often used for text categorization tasks due to their “ability to process many thousand different inputs. This opens the opportunity to use all words in a text directly as features” 

To keep the number of grid search cases to a minimum, we will keep the tuned parameter values for the CountVectorizer and TfidfTransformer found in the previous secion.

In [None]:
# Try using SVM instead of Random Forest Classifier
pipeline2 = Pipeline([
    ('vect', CountVectorizer(tokenizer = tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(SVC()))   # we have defined a SVC variable in the classifier function
])

parameters2 = {'vect__min_df': [5],
              'tfidf__use_idf':[True],
              'clf__estimator__kernel': ['poly'], 
              'clf__estimator__degree': [1, 2, 3],
              'clf__estimator__C':[1, 10, 100]}

cv2 = GridSearchCV(pipeline2, param_grid = parameters2, scoring = scorer, verbose = 10)

# Find best parameters
np.random.seed(81)
tuned_model2 = cv2.fit(X_train, Y_train)

Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] clf__estimator__C=1, clf__estimator__degree=1, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  clf__estimator__C=1, clf__estimator__degree=1, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5, score=0.0, total= 2.8min
[CV] clf__estimator__C=1, clf__estimator__degree=1, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  4.5min remaining:    0.0s


[CV]  clf__estimator__C=1, clf__estimator__degree=1, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5, score=0.0, total= 2.7min
[CV] clf__estimator__C=1, clf__estimator__degree=1, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  8.9min remaining:    0.0s


[CV]  clf__estimator__C=1, clf__estimator__degree=1, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5, score=0.0, total= 2.7min
[CV] clf__estimator__C=1, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed: 13.5min remaining:    0.0s


[CV]  clf__estimator__C=1, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5, score=0.0, total= 3.0min
[CV] clf__estimator__C=1, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed: 18.4min remaining:    0.0s


[CV]  clf__estimator__C=1, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5, score=0.0, total= 2.8min
[CV] clf__estimator__C=1, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed: 23.1min remaining:    0.0s


[CV]  clf__estimator__C=1, clf__estimator__degree=2, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5, score=0.0, total= 3.0min
[CV] clf__estimator__C=1, clf__estimator__degree=3, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 28.1min remaining:    0.0s


[CV]  clf__estimator__C=1, clf__estimator__degree=3, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5, score=0.0, total= 3.1min
[CV] clf__estimator__C=1, clf__estimator__degree=3, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed: 33.3min remaining:    0.0s


[CV]  clf__estimator__C=1, clf__estimator__degree=3, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5, score=0.0, total= 2.9min
[CV] clf__estimator__C=1, clf__estimator__degree=3, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed: 38.1min remaining:    0.0s


[CV]  clf__estimator__C=1, clf__estimator__degree=3, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5, score=0.0, total= 3.0min
[CV] clf__estimator__C=10, clf__estimator__degree=1, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed: 42.9min remaining:    0.0s


[CV]  clf__estimator__C=10, clf__estimator__degree=1, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5, score=0.0, total= 3.2min
[CV] clf__estimator__C=10, clf__estimator__degree=1, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 
[CV]  clf__estimator__C=10, clf__estimator__degree=1, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5, score=0.0, total= 3.5min
[CV] clf__estimator__C=10, clf__estimator__degree=1, clf__estimator__kernel=poly, tfidf__use_idf=True, vect__min_df=5 


- The above function could take a lot of time to run about upto 120-150 min depending on the system

In [None]:
# Get results of grid search
tuned_model2.cv_results_

In [None]:
# Calculate evaluation metrics for test set
tuned_pred_test2 = tuned_model2.predict(X_test)

eval_metrics2 = get_eval_metrics(np.array(Y_test), tuned_pred_test2, col_names)

print(eval_metrics2)

The model performs well with regard to F1 score in one case ("related") but terribly in all other cases. We could try some more parameter values for the SVM in order to try to find a combination that will work, but instead, we shall just stick with the original tuned Random Forest Classifier based model.

### 17. Export your model as a pickle file

In [None]:
# Pickle best model
pickle.dump(tuned_model, open('disaster_model.sav', 'wb'))