# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
!pip install scikit-learn pandas nltk --upgrade

Collecting scikit-learn
[?25l  Downloading https://files.pythonhosted.org/packages/f5/ef/bcd79e8d59250d6e8478eb1290dc6e05be42b3be8a86e3954146adbc171a/scikit_learn-0.24.2-cp36-cp36m-manylinux1_x86_64.whl (20.0MB)
[K    100% |████████████████████████████████| 20.0MB 939kB/s eta 0:00:01   12% |████                            | 2.5MB 10.7MB/s eta 0:00:02    15% |█████                           | 3.2MB 29.9MB/s eta 0:00:01    19% |██████▏                         | 3.8MB 20.0MB/s eta 0:00:01    25% |████████▏                       | 5.1MB 13.3MB/s eta 0:00:02    28% |█████████▏                      | 5.7MB 12.1MB/s eta 0:00:02    31% |██████████▏                     | 6.4MB 17.7MB/s eta 0:00:01    38% |████████████▎                   | 7.7MB 17.6MB/s eta 0:00:01    44% |██████████████▍                 | 9.0MB 12.7MB/s eta 0:00:01    51% |████████████████▌               | 10.3MB 14.4MB/s eta 0:00:01    54% |█████████████████▋              | 11.0MB 11.6MB/s eta 0:00:01    58% |██████████████

In [1]:
!pip list


Package                       Version    
----------------------------- -----------
altair                        1.2.1      
asn1crypto                    0.22.0     
atari-py                      0.1.7      
atomicwrites                  1.3.0      
attrs                         19.1.0     
audioread                     2.1.6      
av                            0.3.3      
awscli                        1.16.17    
backcall                      0.1.0      
backports.functools-lru-cache 1.4        
backports.weakref             1.0rc1     
beautifulsoup4                4.6.0      
bleach                        1.5.0      
blinker                       1.4        
bokeh                         0.12.13    
boto                          2.48.0     
boto3                         1.9.7      
botocore                      1.12.7     
Box2D                         2.3.2      
Box2D-kengz                   2.3.3      
bresenham                     0.2        
bz2file                       0.98

In [2]:
# import libraries
from joblib import dump, load

from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn import metrics

from sqlalchemy import create_engine
import pandas as pd

import numpy as np

In [3]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('pos_tag')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Error loading pos_tag: Package 'pos_tag' not found in
[nltk_data]     index


False

In [4]:
pd.set_option('max_columns',50)

## 1. Load Data
load data from sqlite database (created in ETL)

In [5]:
# load data from database
engine = create_engine('sqlite:///InsertDatabaseName.db')
df = pd.read_sql('SELECT * from InsertTableName', engine)


In [6]:
# show what has been loaded
#
df.iloc[7500:7550,:]

Unnamed: 0,id,message,original,genre,categories,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
7500,8461,Informations requiere about of cyclon and cold...,Cyclone ou frond froid/informations?,direct,related-1;request-0;offer-0;aid_related-0;medi...,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0
7501,8462,What can i do about the vaginal infection.,Ki sa mwen ka f enfektyon vaginal,direct,related-1;request-0;offer-0;aid_related-0;medi...,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7502,8463,NOTES: Tjis message doesn't mean anything.,Bouboulemen telefone leo 384 o 68 60 oh!,direct,related-0;request-0;offer-0;aid_related-0;medi...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7503,8464,What they're going to do for the anarchy house?,ki sa y ap fe pou kay ki konstwi len sou lot yo,direct,related-1;request-0;offer-0;aid_related-0;medi...,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7504,8465,Haiti don't collapse stay still,HAITI PAP PERI ANNOU KENBE LA.,direct,related-1;request-0;offer-0;aid_related-0;medi...,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7505,8466,Send me what I can do to keep myself safe,VOYE DIM KISA POUM FE POUM KA KENBE PI DJANM.,direct,related-1;request-0;offer-0;aid_related-0;medi...,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7506,8467,"There is Doctor for plague,where I'll find it?...",Eske gen dokt pou ps kikote map jwenn li mesi,direct,related-1;request-1;offer-0;aid_related-1;medi...,1,1,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
7507,8468,I think the water ground could only showering.,M'pan c dlo pi a c benyen pou nou ta benyen avl,direct,related-1;request-0;offer-0;aid_related-0;medi...,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7508,8470,"Thank you for the counsels, unfortunately I ha...",msi pou konsey yo malerezman m genta trape mik...,direct,related-1;request-0;offer-0;aid_related-0;medi...,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7509,8471,United Nations told me just wait the distribut...,Nasyonzini te dim rete tann distribisxon tant ...,direct,related-1;request-1;offer-0;aid_related-1;medi...,1,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [7]:
# describe the dataset - see how many columns have a value
df.describe()

Unnamed: 0,id,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
count,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0
mean,15226.183985,0.766427,0.170538,0.00447,0.41412,0.079462,0.050046,0.027659,0.017994,0.032816,0.0,0.063761,0.111438,0.088172,0.015434,0.023036,0.011384,0.033389,0.045538,0.131456,0.065136,0.045805,0.050848,0.020324,0.006074,0.010811,0.004584,0.011805,0.043972,0.278347,0.082098,0.093215,0.010773,0.093674,0.020171,0.052567,0.193421
std,8827.169602,0.423112,0.376112,0.066708,0.492579,0.270464,0.218044,0.163997,0.13293,0.178159,0.0,0.244331,0.31468,0.283551,0.123274,0.150022,0.106091,0.179655,0.208485,0.337905,0.24677,0.209067,0.219692,0.141109,0.077702,0.103416,0.067554,0.108008,0.205036,0.448194,0.274519,0.290739,0.103236,0.29138,0.140588,0.223172,0.394988
min,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7448.75,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,15663.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,22925.25,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,30265.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### 1.1 check distribution of labels

In [8]:
df.drop(columns=['id','message','original','categories','genre']).sum()

related                   20062
request                    4464
offer                       117
aid_related               10840
medical_help               2080
medical_products           1310
search_and_rescue           724
security                    471
military                    859
child_alone                   0
water                      1669
food                       2917
shelter                    2308
clothing                    404
money                       603
missing_people              298
refugees                    874
death                      1192
other_aid                  3441
infrastructure_related     1705
transport                  1199
buildings                  1331
electricity                 532
tools                       159
hospitals                   283
shops                       120
aid_centers                 309
other_infrastructure       1151
weather_related            7286
floods                     2149
storm                      2440
fire    

In [9]:
df.drop(columns=['id','message','original','categories','genre']).hist(figsize=(16,12), sharey=True, sharex=True);

### 1.2 split dataset in Features and Labels

In [10]:
X = df['message']
Y = df.drop(columns=['message','original','categories','genre','id'])

## 2. Review dataset
### 2.1 check for URLs 

In [11]:
pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
x_urls = X.str.findall(pattern)
x_urls[[ len(x)>0 for x in x_urls]]
# alternative x_urls[x_urls.str.len().gt(0)]

5019                              [http://www.jobpaw.com/]
5288     [http://welcome.topuertorico.org/government.sh...
7328              [http://wap.sina.comhttp://wap.sina.com]
8835              [http://ea.mobile.nokia.com/ea/graphics]
9707     [http://172.16.3.136/mymain2.php, http://172.1...
                               ...                        
24965    [http://www.usaid.gov/hum_response/ofda/situat...
25214    [http://www.ausaid.gov.au/hottopics/topic.cfm?...
25345    [http://www.irinnews.org/Report/94826/SENEGAL-...
25668    [http://www.ocfa.gov.ae/En/MediaCenter/OCFANew...
26126    [http://agritrade.cta.int/Agriculture/Commodit...
Name: message, Length: 668, dtype: object

In [12]:
pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
X.replace(to_replace=pattern, value='urlplaceholder', regex=True, inplace=True)


In [13]:
pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
x_urls = X.str.findall(pattern)
x_urls_df = pd.DataFrame({'url': x_urls})
x_urls_df['num_of_entries'] = x_urls_df.applymap(lambda x: len(x))
x_urls_df[x_urls_df['num_of_entries']>0]

Unnamed: 0,url,num_of_entries


### 2.2  Find Nulls - Rows that have just Nulls

In [14]:
Y[~Y.any(axis=1)]

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
17,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
18,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26160,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
26164,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
26169,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
26171,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### 2.2.1 Add new dummy column for NULL rows
related shows is it is relevant for classification

In [15]:
#Y['no_category'] = 0
#Y.loc[~Y.any(axis=1), 'no_category']=1
#Y

## 3. Write a tokenization function to process your text data

In [16]:
pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
reg = re.compile(pattern)

def tokenize(text, enable_lemmatizer=False):
    # replace url
    text = reg.sub('urlplaceholder',text)
    # tokenize text
    tokens = word_tokenize(text)
    
    # lemmatize andremove stop words
    if enable_lemmatizer:
        tokens = [lemmatizer.lemmatize(word.lower().strip()) for word in tokens]

    return tokens

In [17]:
class StartingVerbExtractor(BaseEstimator, TransformerMixin):
    ''' Determines if first word 
    
        in a body of text is a verb
    
    '''
    def starting_verb(self, text):
        sentence_list = nltk.sent_tokenize(text)
        for sentence in sentence_list:
            pos_tags = nltk.pos_tag(tokenize(sentence))
            first_word, first_tag = pos_tags[0]
            if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                return True
        return False

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.starting_verb)
        return pd.DataFrame(X_tagged)

## 4. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [18]:
len(Y.columns)

36

In [19]:
from sklearn.multioutput import MultiOutputClassifier


def build_pipeline (starting_verb=False):
    if not starting_verb:
        pipeline = Pipeline([
            #('vectorizer', CountVectorizer()),
            #('tfidf', TfidfTransformer())
            ('tfidf_vect', TfidfVectorizer()),
            ('clf', MultiOutputClassifier(RandomForestClassifier()))
        ])
    else:    
        pipeline = Pipeline([
            ('features', FeatureUnion([
                ('text_pipeline', Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer())
                ])),

                ('starting_verb', StartingVerbExtractor())
            ])),

            ('clf', MultiOutputClassifier(RandomForestClassifier()))
        ])

    return pipeline








### 4.1 TfidfVectorizer Parameters

**max_df**: float or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words)

**min_df**: float or int, default=1 
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.

### 4.2 Train & Test  pipeline
- Split data into train and test sets
- Train pipeline

In [22]:
# split data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, train_size=0.1)


pipeline = build_pipeline(False)
#run the pipeline
pipeline.fit(X_train, y_train)

Pipeline(steps=[('tfidf_vect', TfidfVectorizer()),
                ('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier()))])

In [23]:
y_pred = pipeline.predict(X_test)

## 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [24]:
print(metrics.classification_report(y_test.reset_index(drop=True), y_pred, target_names=y_test.columns.values ))

                        precision    recall  f1-score   support

               related       0.80      0.97      0.87      6011
               request       0.89      0.32      0.47      1328
                 offer       0.00      0.00      0.00        39
           aid_related       0.78      0.51      0.62      3261
          medical_help       0.67      0.01      0.01       629
      medical_products       1.00      0.01      0.02       389
     search_and_rescue       0.00      0.00      0.00       196
              security       0.00      0.00      0.00       147
              military       0.00      0.00      0.00       277
           child_alone       0.00      0.00      0.00         0
                 water       0.92      0.14      0.24       499
                  food       0.84      0.27      0.41       892
               shelter       0.87      0.19      0.31       650
              clothing       0.00      0.00      0.00       127
                 money       1.00      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [24]:
y_test.reset_index(drop=True)

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0
1,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1
2,1,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0
4,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7848,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7849,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,1,1,1,0,0,0,0,0,0
7850,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1
7851,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [25]:
y_pred

array([[1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 1, 0, ..., 0, 0, 1]])

### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
# parameters for GridSearch
parameters = {
    'tfidf_vect__ngram_range': ((1, 1), (1, 2)),
    'tfidf_vect__max_df': (0.5, 0.75, 1.0),
    'tfidf_vect__max_features': (None, 5000, 10000),
    #'tfidf_vect__tfidf__use_idf': (True, False),
    #'tfidf_vect__stop_words': (None, stopwords.words('english'))
}

In [None]:
def display_results(cv, y_test, y_pred):
    labels = np.unique(y_pred)
    confusion_mat = confusion_matrix(y_test, y_pred, labels=labels)
    accuracy = (y_pred == y_test).mean()

    print("Labels:", labels)
    print("Confusion Matrix:\n", confusion_mat)
    print("Accuracy:", accuracy)
    print("\nBest Parameters:", cv.best_params_)

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
cv = GridSearchCV(pipeline, param_grid = parameters)

In [None]:
cv = GridSearchCV(pipeline, param_grid = parameters)
model = cv
model.fit(X_train, y_train)
y_pred = model.predict(X_test)




In [29]:
!pip install pickle --upgrade

Collecting pickle
[31m  Could not find a version that satisfies the requirement pickle (from versions: )[0m
[31mNo matching distribution found for pickle[0m


In [30]:
import pickle

In [31]:
s = pickle.dumps(model)


In [32]:
dump(model, 'disaster_response_model.joblib') 

['disaster_response_model.joblib']

In [None]:
model = load('disaster_response_model.joblib') 

In [34]:
display_results(model, y_test, y_pred)

                        precision    recall  f1-score   support

               related       0.83      0.93      0.88      6015
               request       0.85      0.44      0.58      1322
                 offer       0.00      0.00      0.00        38
           aid_related       0.76      0.56      0.64      3275
          medical_help       0.56      0.13      0.21       611
      medical_products       0.69      0.12      0.21       390
     search_and_rescue       0.62      0.10      0.17       220
              security       0.33      0.02      0.04       139
              military       0.58      0.12      0.20       255
           child_alone       0.00      0.00      0.00         0
                 water       0.79      0.47      0.59       485
                  food       0.84      0.58      0.69       905
               shelter       0.81      0.37      0.51       706
              clothing       0.74      0.28      0.41       124
                 money       1.00      

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [None]:
display_results(model, y_test, y_pred)

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.