# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [50]:
# import libraries
import pandas as pd
from sqlalchemy import create_engine
import re
import numpy as np


from sklearn import multioutput

from sklearn.metrics import fbeta_score, make_scorer

import matplotlib.pyplot as plt
%matplotlib inline

In [51]:
import nltk
import os

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
nltk.data.path.append(os.getcwd())
nltk.download('averaged_perceptron_tagger', download_dir=os.getcwd())

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /workspace/home...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [52]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('MessagesCategories', engine)

df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [53]:
y_cols = []

for i in df.columns:
    if i not in ['id', 'message', 'original', 'genre']:
        y_cols.append(i)

In [54]:

X = df['message']
y = df[y_cols]
display (y.head(n=10))

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
7,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


### 2. Write a tokenization function to process your text data

In [55]:
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

def tokenize(text):
    '''
        Input: take the raw text
        Process: normalized, stopwords removed, stemmed and lemmetized 
        
        Output: return a list of lemmatized tokens.
    '''
    
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")
        
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens
        
#     # Normalize text
#     text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
#     stop_words = stopwords.words("english")
    
#     #tokenize
#     tokens = word_tokenize (text)
    
#     #lemmatize
#     lemmatizer = WordNetLemmatizer()
    
#     #porter stemmer
#     por_stem = PorterStemmer()
    
#     #stemming
#     stemmed = [por_stem.stem(w) for w in tokens]
    
#     #lemmatizing
#     words_lemmed = [lemmatizer.lemmatize(w, pos='v') for w in stemmed if w not in stop_words]
   
#     return words_lemmed



In [56]:
from sklearn.base import BaseEstimator, TransformerMixin

class StartingVerbExtractor(BaseEstimator, TransformerMixin):

    def starting_verb(self, text):
        sentence_list = nltk.sent_tokenize(text)
        for sentence in sentence_list:
            pos_tags = nltk.pos_tag(tokenize(sentence))
            first_word, first_tag = pos_tags[0]
            if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                return True
        return False

    def fit(self, x, y=None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.starting_verb)
        return pd.DataFrame(X_tagged)

In [57]:
from sklearn.metrics import f1_score

def f1_score_mean(y_test, y_pred):
    y_test_np = np.array(y_test)
    y_pred = np.array(y_pred)

    # Initialize list to store F1-scores for each output
    f1_scores_per_output = []

    # Iterate through each column (output)
    for i in range(y_test_np.shape[1]):
        # Compute F1-score for each output column
        f1 = f1_score(y_test_np[:, i], y_pred[:, i], average='weighted')
        f1_scores_per_output.append(f1)

    return f1_scores_per_output


In [86]:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

# Define a function to manually create a classification report
def ML_classification_report(y_test, y_pred):
    '''
        INPUT: y_test, y_pred
        
        PROCESS: Generate Classification Report by takeing precision, recall, f1, support, accuracy
        
        OUTPUT: DataFrame of precision, recall, f1-score, support, and accuracy
    
    '''
    
    report = {}
    for i, column in enumerate(y_test.columns):
        # Calculate precision, recall, and f1-score for each column
        precision = precision_score(y_test[column], y_pred[:, i], average='weighted')
        recall = recall_score(y_test[column], y_pred[:, i], average='weighted')
        f1 = f1_score(y_test[column], y_pred[:, i], average='weighted')
        support = y_test[column].sum()
        accuracy = accuracy_score(y_test.iloc[:, i].values, y_pred[:, i])

        # Store the results in a dictionary
        report[column] = {
            'precision': precision.round(2),
            'recall': recall.round(2),
            'f1-score': f1.round(2),
            'support': support,
            'accuracy': accuracy.round(2)
        }
    
    # Convert the dictionary to a DataFrame and transpose it
    return pd.DataFrame(report).transpose()



In [59]:
# from sklearn.metrics import classification_report, accuracy_score


# def display_classification_report(y_test, y_pred):
    
#     for i, column in enumerate(y_test.columns):
#         print(f'Classification report of {column}:\n')
#         print(classification_report(y_test[column], y_pred[:, i]))

#         print("Accuracy: {0:.2f} %\n".format(accuracy_score(y_test.iloc[:, i].values, y_pred[:, i])))
#         print('-' * 60)

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [60]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.pipeline import Pipeline, FeatureUnion



pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer = tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(KNeighborsClassifier()))])
    
pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'vect', 'tfidf', 'clf', 'vect__analyzer', 'vect__binary', 'vect__decode_error', 'vect__dtype', 'vect__encoding', 'vect__input', 'vect__lowercase', 'vect__max_df', 'vect__max_features', 'vect__min_df', 'vect__ngram_range', 'vect__preprocessor', 'vect__stop_words', 'vect__strip_accents', 'vect__token_pattern', 'vect__tokenizer', 'vect__vocabulary', 'tfidf__norm', 'tfidf__smooth_idf', 'tfidf__sublinear_tf', 'tfidf__use_idf', 'clf__estimator__algorithm', 'clf__estimator__leaf_size', 'clf__estimator__metric', 'clf__estimator__metric_params', 'clf__estimator__n_jobs', 'clf__estimator__n_neighbors', 'clf__estimator__p', 'clf__estimator__weights', 'clf__estimator', 'clf__n_jobs'])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [61]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [62]:
%%time
# train the classifier
pipeline.fit(X_train, y_train)

CPU times: user 9.16 s, sys: 95.3 ms, total: 9.26 s
Wall time: 9.26 s


Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...ric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [63]:
%%time
y_pred = pipeline.predict(X_test)

print(f'y_pred: {y_pred}')

y_pred: [[ 1.  0.  0. ...,  0.  0.  0.]
 [ 1.  0.  0. ...,  0.  0.  0.]
 [ 1.  0.  0. ...,  0.  0.  0.]
 ..., 
 [ 1.  0.  0. ...,  0.  0.  0.]
 [ 1.  0.  0. ...,  0.  0.  0.]
 [ 1.  0.  0. ...,  0.  0.  0.]]
CPU times: user 3min 37s, sys: 1min 10s, total: 4min 47s
Wall time: 4min 48s


In [64]:
overall_accuracy = (y_pred == y_test).mean().mean()*100
print('Overall Accuracy: {0:.1f} %'.format(overall_accuracy))

Overall Accuracy: 92.1 %


In [65]:
scores = f1_score_mean(y_test, y_pred)
print(f"f1_score mean :{np.mean(scores)}")

f1_score mean :0.899121448835163


  'precision', 'predicted', average, warn_for)


In [87]:
ML_classification_report(y_test, y_pred)

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Unnamed: 0,accuracy,f1-score,precision,recall,support
related,0.74,0.68,0.67,0.74,5131.0
request,0.81,0.77,0.76,0.81,1127.0
offer,0.99,0.99,0.99,0.99,35.0
aid_related,0.56,0.51,0.52,0.56,2750.0
medical_help,0.92,0.88,0.86,0.92,520.0
medical_products,0.95,0.92,0.9,0.95,332.0
search_and_rescue,0.97,0.96,0.94,0.97,197.0
security,0.98,0.97,0.96,0.98,118.0
military,0.96,0.95,0.93,0.96,231.0
child_alone,1.0,1.0,1.0,1.0,0.0


### 6. Improve your model
Use grid search to find better parameters. 

In [67]:
# This code of line will return pip
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7f08e2f506a8>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
              metric_params=None, n_jobs=1, n_neighbors=5, p=2,
              weights='uniform'),
              n_jobs=1))],
 'vect': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=1.0, max_fea

In [68]:
from sklearn.model_selection import GridSearchCV

# parameters = {
#     'vect__ngram_range': ((1, 1),),
#     'vect__max_df': (0.5, 0.75), 
#     'clf__estimator__n_estimators': [50],
#     'clf__estimator__min_samples_split': [2, 3]
# }

parameters = {
    'vect__ngram_range': ((1, 1),),
    'clf__estimator__n_neighbors': [5, 10],
    'vect__max_df': (0.5, 0.75)
}

cv = GridSearchCV(pipeline, param_grid=parameters, cv=3, verbose=2, n_jobs=1)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [69]:
%%time
cv.fit(X_train, y_train)
# print('Best Parameters :', cv.best_params_)

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] clf__estimator__n_neighbors=5, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  clf__estimator__n_neighbors=5, vect__max_df=0.5, vect__ngram_range=(1, 1), total= 2.8min
[CV] clf__estimator__n_neighbors=5, vect__max_df=0.5, vect__ngram_range=(1, 1) 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  8.3min remaining:    0.0s


[CV]  clf__estimator__n_neighbors=5, vect__max_df=0.5, vect__ngram_range=(1, 1), total= 2.8min
[CV] clf__estimator__n_neighbors=5, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  clf__estimator__n_neighbors=5, vect__max_df=0.5, vect__ngram_range=(1, 1), total= 2.8min
[CV] clf__estimator__n_neighbors=5, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  clf__estimator__n_neighbors=5, vect__max_df=0.75, vect__ngram_range=(1, 1), total= 3.0min
[CV] clf__estimator__n_neighbors=5, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  clf__estimator__n_neighbors=5, vect__max_df=0.75, vect__ngram_range=(1, 1), total= 3.0min
[CV] clf__estimator__n_neighbors=5, vect__max_df=0.75, vect__ngram_range=(1, 1) 
[CV]  clf__estimator__n_neighbors=5, vect__max_df=0.75, vect__ngram_range=(1, 1), total= 3.0min
[CV] clf__estimator__n_neighbors=10, vect__max_df=0.5, vect__ngram_range=(1, 1) 
[CV]  clf__estimator__n_neighbors=10, vect__max_df=0.5, vect__ngram_range=(1, 1), total= 2.8min
[CV] clf__estimator__n

[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed: 102.0min finished


CPU times: user 1h 16min, sys: 26min 4s, total: 1h 42min 5s
Wall time: 1h 42min 10s


GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...ric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'vect__ngram_range': ((1, 1),), 'clf__estimator__n_neighbors': [5, 10], 'vect__max_df': (0.5, 0.75)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=2)

In [70]:
print('Best Param eters:', cv.best_params_)


Best Parameters: {'clf__estimator__n_neighbors': 10, 'vect__max_df': 0.75, 'vect__ngram_range': (1, 1)}


This output shows the execution of a GridSearchCV on a machine learning pipeline, which includes a CountVectorizer (text vectorization) and a classifier (presumably a KNeighborsClassifier). The grid search is evaluating the following hyperparameters:

vect__ngram_range: Specifies the range of n-grams for text features, set to (1, 1), which means unigrams only.
clf__estimator__n_neighbors: The number of neighbors for the K-Nearest Neighbors (KNN) classifier, tested with values [5, 10].
vect__max_df: Maximum document frequency for terms, which controls how common a term can be across the documents. The grid search tests values [0.5, 0.75].
The grid search was performed with 3-fold cross-validation, meaning the data was split into three parts, and the model was trained and validated three times for each combination of parameters.

Execution Details:
The grid search involved a total of 4 parameter combinations (2 x 2), leading to 12 fits (4 combinations × 3 folds).
Each fit took around 2.8 to 3.1 minutes to complete.
The total execution time was approximately 1 hour and 42 minutes.
Results:
The verbose output confirms that all combinations were evaluated successfully, but it does not provide specific details about the best parameters found or the associated scores. This would typically be accessed through attributes like cv.best_params_ and cv.best_score_.
This grid search process is quite standard when optimizing hyperparameters for a machine learning pipeline involving text data and a KNN classifier.

In [71]:
y_pred_tuned = cv.predict(X_test)

#turning to a dataframe
y_pred_df = pd.DataFrame(y_pred_tuned, columns = y_test.columns)

In [72]:
y_pred_tuned

array([[ 1.,  0.,  0., ...,  0.,  0.,  0.],
       [ 1.,  0.,  0., ...,  0.,  0.,  0.],
       [ 1.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 1.,  0.,  0., ...,  0.,  0.,  0.],
       [ 1.,  0.,  0., ...,  0.,  0.,  0.],
       [ 1.,  0.,  0., ...,  0.,  0.,  0.]])

In [75]:
new_overall_accuracy = (y_pred_tuned == y_test).mean().mean()*100
print('Overall Accuracy: {0:.1f} %'.format(new_overall_accuracy))

Overall Accuracy: 92.4 %


In [76]:
new_f1_scores = f1_score_mean(y_test, y_pred_tuned)
print(f"f1_score mean :{np.mean(new_f1_scores)}")

f1_score mean :0.8995074888578789


  'precision', 'predicted', average, warn_for)


In [88]:
ML_classification_report(y_test, y_pred_tuned)

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Unnamed: 0,accuracy,f1-score,precision,recall,support
related,0.75,0.68,0.66,0.75,5131.0
request,0.82,0.77,0.76,0.82,1127.0
offer,0.99,0.99,0.99,0.99,35.0
aid_related,0.56,0.51,0.52,0.56,2750.0
medical_help,0.92,0.88,0.85,0.92,520.0
medical_products,0.95,0.92,0.9,0.95,332.0
search_and_rescue,0.97,0.96,0.94,0.97,197.0
security,0.98,0.97,0.96,0.98,118.0
military,0.96,0.95,0.93,0.96,231.0
child_alone,1.0,1.0,1.0,1.0,0.0


## Key Points:
**GridSearchCV**: The grid search optimized a pipeline combining a CountVectorizer and a K-Nearest Neighbors (KNN) classifier by evaluating different combinations of hyperparameters: ngram_range, n_neighbors, and max_df. It used 3-fold cross-validation, with the best model being used for prediction.

**Prediction and Performance:**

The tuned model predicted the test set (X_test), and the predictions were stored in a DataFrame (y_pred_df).
The overall accuracy of the predictions was 92.4%.
The mean F1 score across all labels was 0.8995.
Classification Report: The model's performance on individual categories was evaluated using metrics like accuracy, F1-score, precision, and recall. The report highlighted performance across multiple categories (e.g., 'related', 'request', 'medical_help') with varying levels of accuracy and F1 scores, indicating that while the model performed well on some labels, others (like 'child_alone') were either underrepresented or predicted poorly.

Warnings: Several warnings were raised regarding the F-score and precision being undefined for labels with no predicted samples, indicating potential issues with class imbalance or insufficient predictions for certain classes.

# 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [78]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

def build_feature_model():
    pipeline = Pipeline([
        ('feature', FeatureUnion([
            ('test_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf',TfidfTransformer())
            ])),
            ('starting_verb', StartingVerbExtractor())
        ])),
        ('clf', MultiOutputClassifier(AdaBoostClassifier()))
    ])
    return pipeline

build_feature_model().get_params().keys()

dict_keys(['memory', 'steps', 'feature', 'clf', 'feature__n_jobs', 'feature__transformer_list', 'feature__transformer_weights', 'feature__test_pipeline', 'feature__starting_verb', 'feature__test_pipeline__memory', 'feature__test_pipeline__steps', 'feature__test_pipeline__vect', 'feature__test_pipeline__tfidf', 'feature__test_pipeline__vect__analyzer', 'feature__test_pipeline__vect__binary', 'feature__test_pipeline__vect__decode_error', 'feature__test_pipeline__vect__dtype', 'feature__test_pipeline__vect__encoding', 'feature__test_pipeline__vect__input', 'feature__test_pipeline__vect__lowercase', 'feature__test_pipeline__vect__max_df', 'feature__test_pipeline__vect__max_features', 'feature__test_pipeline__vect__min_df', 'feature__test_pipeline__vect__ngram_range', 'feature__test_pipeline__vect__preprocessor', 'feature__test_pipeline__vect__stop_words', 'feature__test_pipeline__vect__strip_accents', 'feature__test_pipeline__vect__token_pattern', 'feature__test_pipeline__vect__tokenizer',

In [79]:
%%time
feature_model = build_feature_model()

feature_model.fit(X_train, y_train)

CPU times: user 3min 7s, sys: 1.28 s, total: 3min 9s
Wall time: 3min 9s


In [80]:
%%time

y_pred_featured = feature_model.predict(X_test)

CPU times: user 18 s, sys: 410 ms, total: 18.4 s
Wall time: 18.5 s


In [81]:
new_overall_accuracy = (y_pred_featured == y_test).mean().mean()*100
print('Overall Accuracy: {0:.1f} %'.format(new_overall_accuracy))

new_f1_scores = f1_score_mean(y_test, y_pred_featured)
print(f"f1_score mean :{np.mean(new_f1_scores)}")

Overall Accuracy: 92.5 %
f1_score mean :0.8993194565291225


In [89]:
ML_classification_report(y_test, y_pred_featured)

Unnamed: 0,accuracy,f1-score,precision,recall,support
related,0.76,0.67,0.67,0.76,5131.0
request,0.82,0.78,0.77,0.82,1127.0
offer,0.99,0.99,0.99,0.99,35.0
aid_related,0.57,0.51,0.54,0.57,2750.0
medical_help,0.92,0.88,0.93,0.92,520.0
medical_products,0.95,0.92,0.9,0.95,332.0
search_and_rescue,0.97,0.95,0.94,0.97,197.0
security,0.98,0.97,0.96,0.98,118.0
military,0.96,0.95,0.93,0.96,231.0
child_alone,1.0,1.0,1.0,1.0,0.0


## Summary of Model Improvement

Objective: Enhance the performance of a multi-label classification model by exploring different machine learning algorithms and adding new features.

#### Approach:

Incorporated additional features beyond TF-IDF by using a StartingVerbExtractor to capture linguistic nuances.
Implemented AdaBoostClassifier within a MultiOutputClassifier framework to handle multi-label classification.
Evaluated the model using a comprehensive pipeline that combines multiple feature extraction techniques.
Results:

#### Overall Accuracy: 92.5%
Average F1-Score: 0.899
The model showed high performance across several categories, with some challenges in less frequent categories like aid_related.
Next Steps:

Experiment with other classifiers (e.g., RandomForestClassifier).
Further refine feature engineering, possibly incorporating word embeddings or additional syntactic features.
Explore hyperparameter tuning and techniques for handling imbalanced data.

In [92]:
report_1 = ML_classification_report(y_test, y_pred_tuned)
report_2 = ML_classification_report(y_test, y_pred_featured)

comparison_df = pd.concat([report_1.add_suffix('_tuned'), report_2.add_suffix('_featured')], axis=1)

comparison_df

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Unnamed: 0,accuracy_tuned,f1-score_tuned,precision_tuned,recall_tuned,support_tuned,accuracy_featured,f1-score_featured,precision_featured,recall_featured,support_featured
related,0.75,0.68,0.66,0.75,5131.0,0.76,0.67,0.67,0.76,5131.0
request,0.82,0.77,0.76,0.82,1127.0,0.82,0.78,0.77,0.82,1127.0
offer,0.99,0.99,0.99,0.99,35.0,0.99,0.99,0.99,0.99,35.0
aid_related,0.56,0.51,0.52,0.56,2750.0,0.57,0.51,0.54,0.57,2750.0
medical_help,0.92,0.88,0.85,0.92,520.0,0.92,0.88,0.93,0.92,520.0
medical_products,0.95,0.92,0.9,0.95,332.0,0.95,0.92,0.9,0.95,332.0
search_and_rescue,0.97,0.96,0.94,0.97,197.0,0.97,0.95,0.94,0.97,197.0
security,0.98,0.97,0.96,0.98,118.0,0.98,0.97,0.96,0.98,118.0
military,0.96,0.95,0.93,0.96,231.0,0.96,0.95,0.93,0.96,231.0
child_alone,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0


### 9. Export your model as a pickle file

In [93]:
import pickle

with open('classifier.pkl', 'wb') as f:
    pickle.dump(feature_model, f)

### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.