# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [86]:
# import libraries
import numpy as np
import pandas as pd
from sqlalchemy import create_engine
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

# Create an SQLite engine
engine = create_engine('sqlite:///samtable.db')

# Load dataset from database with read_sql_table
df = pd.read_sql_table('samtable', engine)
df.head()

# Define feature and target variables X and Y
X = df['message']
Y = df.drop(['id', 'message', 'original', 'genre'], axis=1)

# check the value of X and Y

print("total row in Y:", Y.shape[0])
print("total column in Y:", Y.shape[1])
print("total missing value in Y", Y.isnull().sum().sum())  # need to be delete?

print("total row in X:", X.shape[0])
print("unique value X:", X.nunique())
print("missing value in X:", X.isnull().sum().sum())  








total row in Y: 26386
total column in Y: 35
total missing value in Y 4830
total row in X: 26386
unique value X: 26177
missing value in X: 0


In [87]:
# remove missin value from Y
Y.dropna(inplace=True)

In [96]:
print("total row in Y:", Y.shape[0])
print("total column in Y:", Y.shape[1])
print("total missing value in Y", Y.isnull().sum().sum())  # need to be delete?

print("total row in X:", X.shape[0])
print("unique value X:", X.nunique())
print("missing value in X:", X.isnull().sum().sum())  


total row in Y: 26248
total column in Y: 35
total missing value in Y 0
total row in X: 26386
unique value X: 26177
missing value in X: 0


In [106]:
## check the difference btw X and Y
diff_rows = X.shape[0] - Y.shape[0]

print(diff_rows)

# if x have more row 
if diff_rows > 0:
    X = X.sample(n=Y.shape[0], random_state=42)  



138


In [107]:
print("total row in Y:", Y.shape[0])
print("total column in Y:", Y.shape[1])
print("total missing value in Y", Y.isnull().sum().sum())  # need to be delete?

print("total row in X:", X.shape[0])
print("unique value X:", X.nunique())
print("missing value in X:", X.isnull().sum().sum())  

total row in Y: 26248
total column in Y: 35
total missing value in Y 0
total row in X: 26248
unique value X: 26041
missing value in X: 0


### 2. Write a tokenization function to process your text data

In [108]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

def tokenize(text):
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    stop_words = stopwords.words("english")
    
    clean_tokens = []
    for token in tokens:
        if token not in stop_words:
            if re.match('^[a-zA-Z0-9_-]*$', token):  # Controlla se il token è composto solo da lettere, numeri, trattini e underscore
                clean_tokens.append(lemmatizer.lemmatize(token).lower().strip())

    return clean_tokens



[nltk_data] Downloading package punkt to
[nltk_data]     /Users/samuelmilazzo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/samuelmilazzo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [113]:
# Split the dataset into training and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

# Convert object columns to int in Y_train
for col in Y_train.columns:
    Y_train[col] = Y_train[col].apply(lambda x: int(x.split('-')[1]) if x is not None else 0).astype(int)

# Convert object columns to int in Y_test
for col in Y_test.columns:
    Y_test[col] = Y_test[col].apply(lambda x: int(x.split('-')[1]) if x is not None else 0).astype(int)
    
# Check data types of target variables
print("Data type of Y_train:", Y_train.dtypes)
print("Data type of Y_test:", Y_test.dtypes)

# Check unique values in target variables
print("Unique values in Y_train:", Y_train.nunique())
print("Unique values in Y_test:", Y_test.nunique())




Data type of Y_train: request                   int64
offer                     int64
aid_related               int64
medical_help              int64
medical_products          int64
search_and_rescue         int64
security                  int64
military                  int64
child_alone               int64
water                     int64
food                      int64
shelter                   int64
clothing                  int64
money                     int64
missing_people            int64
refugees                  int64
death                     int64
other_aid                 int64
infrastructure_related    int64
transport                 int64
buildings                 int64
electricity               int64
tools                     int64
hospitals                 int64
shops                     int64
aid_centers               int64
other_infrastructure      int64
weather_related           int64
floods                    int64
storm                     int64
fire              

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [112]:
from sklearn.model_selection import train_test_split

# Split data into train and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Train pipeline
pipeline.fit(X_train, Y_train)





### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [114]:
# Split the data into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Fit the pipeline on the training data
pipeline.fit(X_train, Y_train)

# Make predictions on the testing data
Y_pred = pipeline.predict(X_test)

# Report the f1 score, precision, and recall for each output category
for i, category in enumerate(Y.columns):
    print(f"Category: {category}")
    print(classification_report(Y_test[category], Y_pred[:, i]))




Category: request
              precision    recall  f1-score   support

   request-0       0.83      0.97      0.90      4355
   request-1       0.13      0.02      0.03       895

    accuracy                           0.81      5250
   macro avg       0.48      0.50      0.46      5250
weighted avg       0.71      0.81      0.75      5250

Category: offer
              precision    recall  f1-score   support

     offer-0       0.99      1.00      1.00      5222
     offer-1       0.00      0.00      0.00        28

    accuracy                           0.99      5250
   macro avg       0.50      0.50      0.50      5250
weighted avg       0.99      0.99      0.99      5250

Category: aid_related


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


               precision    recall  f1-score   support

aid_related-0       0.59      0.81      0.68      3106
aid_related-1       0.38      0.17      0.24      2144

     accuracy                           0.55      5250
    macro avg       0.49      0.49      0.46      5250
 weighted avg       0.50      0.55      0.50      5250

Category: medical_help
                precision    recall  f1-score   support

medical_help-0       0.92      0.99      0.96      4832
medical_help-1       0.05      0.00      0.01       418

      accuracy                           0.91      5250
     macro avg       0.49      0.50      0.48      5250
  weighted avg       0.85      0.91      0.88      5250

Category: medical_products
                    precision    recall  f1-score   support

medical_products-0       0.95      1.00      0.97      4996
medical_products-1       0.08      0.00      0.01       254

          accuracy                           0.95      5250
         macro avg       0.51      0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

     money-0       0.98      1.00      0.99      5123
     money-1       0.00      0.00      0.00       127

    accuracy                           0.98      5250
   macro avg       0.49      0.50      0.49      5250
weighted avg       0.95      0.98      0.96      5250

Category: missing_people
                  precision    recall  f1-score   support

missing_people-0       0.99      1.00      0.99      5193
missing_people-1       0.00      0.00      0.00        57

        accuracy                           0.99      5250
       macro avg       0.49      0.50      0.50      5250
    weighted avg       0.98      0.99      0.98      5250

Category: refugees
              precision    recall  f1-score   support

  refugees-0       0.97      1.00      0.98      5070
  refugees-1       0.00      0.00      0.00       180

    accuracy                           0.96      5250
   macro avg       0.48      0.50      0.49      5250
weight

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

     tools-0       0.99      1.00      1.00      5217
     tools-1       0.00      0.00      0.00        33

    accuracy                           0.99      5250
   macro avg       0.50      0.50      0.50      5250
weighted avg       0.99      0.99      0.99      5250

Category: hospitals
              precision    recall  f1-score   support

 hospitals-0       0.99      1.00      1.00      5206
 hospitals-1       0.00      0.00      0.00        44

    accuracy                           0.99      5250
   macro avg       0.50      0.50      0.50      5250
weighted avg       0.98      0.99      0.99      5250

Category: shops
              precision    recall  f1-score   support

     shops-0       1.00      1.00      1.00      5229
     shops-1       0.00      0.00      0.00        21

    accuracy                           1.00      5250
   macro avg       0.50      0.50      0.50      5250
weighted avg       0.99      1.00     

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


               precision    recall  f1-score   support

aid_centers-0       0.99      1.00      1.00      5203
aid_centers-1       0.00      0.00      0.00        47

     accuracy                           0.99      5250
    macro avg       0.50      0.50      0.50      5250
 weighted avg       0.98      0.99      0.99      5250

Category: other_infrastructure
                        precision    recall  f1-score   support

other_infrastructure-0       0.96      1.00      0.98      5022
other_infrastructure-1       0.00      0.00      0.00       228

              accuracy                           0.96      5250
             macro avg       0.48      0.50      0.49      5250
          weighted avg       0.92      0.96      0.94      5250

Category: weather_related
                   precision    recall  f1-score   support

weather_related-0       0.73      0.95      0.83      3815
weather_related-1       0.26      0.04      0.07      1435

         accuracy                           

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


              precision    recall  f1-score   support

earthquake-0       0.90      0.99      0.94      4738
earthquake-1       0.10      0.01      0.02       512

    accuracy                           0.90      5250
   macro avg       0.50      0.50      0.48      5250
weighted avg       0.82      0.90      0.85      5250

Category: cold
              precision    recall  f1-score   support

      cold-0       0.98      1.00      0.99      5155
      cold-1       0.00      0.00      0.00        95

    accuracy                           0.98      5250
   macro avg       0.49      0.50      0.50      5250
weighted avg       0.96      0.98      0.97      5250

Category: other_weather
                 precision    recall  f1-score   support

other_weather-0       0.96      1.00      0.98      5020
other_weather-1       0.00      0.00      0.00       230

       accuracy                           0.95      5250
      macro avg       0.48      0.50      0.49      5250
   weighted avg     

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


                 precision    recall  f1-score   support

direct_report-0       0.81      0.97      0.88      4243
direct_report-1       0.21      0.03      0.05      1007

       accuracy                           0.79      5250
      macro avg       0.51      0.50      0.47      5250
   weighted avg       0.69      0.79      0.72      5250



### 6. Improve your model
Use grid search to find better parameters. 

In [116]:
parameters = {
    'vect__ngram_range': ((1, 1), (1, 2)),
    'tfidf__use_idf': (True, False),
    'clf__estimator__n_estimators': [10, 50],
    'clf__estimator__min_samples_split': [2, 5, 10]
}

cv = GridSearchCV(pipeline, param_grid=parameters, verbose=2, n_jobs=-1)
cv.fit(X_train, Y_train)


Fitting 5 folds for each of 24 candidates, totalling 120 fits


BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [12]:
Y_pred_tuned = cv.predict(X_test)

# Show the accuracy, precision, and recall of the tuned model
for i, category in enumerate(Y.columns):
    print(f"Category: {category}")
    print(classification_report(Y_test[category], Y_pred_tuned[:, i]))


NotFittedError: This GridSearchCV instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

In [10]:
import joblib

joblib.dump(cv.best_estimator_, 'disaster_response_model.pkl')


NameError: name 'cv' is not defined

### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.