# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [27]:
%pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [28]:
# import libraries
import pandas as pd
import numpy as np
import nltk
from sqlalchemy import create_engine

In [29]:
# Load the necessary resources from nltk
nltk.download('punkt', force=True)
nltk.download('stopwords')
nltk.download('wordnet') # download for lemmatization

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/nguyenviettruong/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nguyenviettruong/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/nguyenviettruong/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [30]:
# load data from database
engine = create_engine('sqlite:///InsertDatabaseName.db')
df = pd.read_sql_table('InsertTableName', engine)  

X = df['message']
y = df.iloc[:, 4:]
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Write a tokenization function to process your text data

In [31]:
import re 
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

def self_tokenize(text):
    text = text.lower()
    # Remove punctuation characters
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())  
    # Words tokenization
    words = word_tokenize(text)   
    # initialize lemmatizer
    lemmatizer = WordNetLemmatizer()

    # Load English stopwords
    stop_words = set(stopwords.words('english'))
    
    # Lemmatization for each word
    lemmas = [lemmatizer.lemmatize(word, pos='v') for word in words if word not in stop_words]
    
    return lemmas

X = X.apply(self_tokenize)
X
y

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26211,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26212,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26213,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26214,1,0,0,1,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [32]:
from sklearn.feature_extraction.text import CountVectorizer

# Apply to covert list word to string
X = X.apply(lambda x: ' '.join(x))

X

0      weather update cold front cuba could pass haiti
1                                            hurricane
2                                    look someone name
3    un report leogane 80 90 destroy hospital st cr...
4       say west side haiti rest country today tonight
Name: message, dtype: object

In [33]:
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# Split data to train data and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a pipeline
pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),  # Vector hóa văn bản
    ('clf', MultiOutputClassifier(RandomForestClassifier(n_estimators=100)))  # MultiOutputClassifier with RandomForest
])


In [34]:
print(type(X_train))
print(type(y_train))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [35]:
# Train model
pipeline.fit(X_train, y_train)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [36]:
from sklearn.metrics import classification_report

# Predict on teset data sets
y_pred = pipeline.predict(X_test)

print("Shape of y_pred: ", y_pred.shape)
print("Shape of y_test: ", y_test.values.shape)

# Number of labels
num_labels = y_test.values.shape[1]  

for i in range(num_labels):
    print(f"Classification Report for label {i}:")
    report = classification_report(y_test.values[:, i], y_pred[:, i], output_dict=True)

    for label, metrics in report.items():
        if isinstance(metrics, dict):
            print(f"Label: {label}")
            print(f"Precision: {metrics['precision']:.4f}")
            print(f"Recall: {metrics['recall']:.4f}")
            print(f"F1-Score: {metrics['f1-score']:.4f}")
            print()


Shape of y_pred:  (5244, 36)
Shape of y_test:  (5244, 36)
Classification Report for label 0:
Label: 0
Precision: 0.6945
Recall: 0.4597
F1-Score: 0.5532

Label: 1
Precision: 0.8419
Recall: 0.9327
F1-Score: 0.8850

Label: 2
Precision: 0.5116
Recall: 0.5500
F1-Score: 0.5301

Label: macro avg
Precision: 0.6827
Recall: 0.6475
F1-Score: 0.6561

Label: weighted avg
Precision: 0.8038
Recall: 0.8156
F1-Score: 0.8022

Classification Report for label 1:
Label: 0
Precision: 0.9061
Recall: 0.9788
F1-Score: 0.9411

Label: 1
Precision: 0.8315
Recall: 0.5073
F1-Score: 0.6301

Label: macro avg
Precision: 0.8688
Recall: 0.7431
F1-Score: 0.7856

Label: weighted avg
Precision: 0.8934
Recall: 0.8984
F1-Score: 0.8880

Classification Report for label 2:
Label: 0
Precision: 0.9950
Recall: 1.0000
F1-Score: 0.9975

Label: 1
Precision: 0.0000
Recall: 0.0000
F1-Score: 0.0000

Label: macro avg
Precision: 0.4975
Recall: 0.5000
F1-Score: 0.4988

Label: weighted avg
Precision: 0.9901
Recall: 0.9950
F1-Score: 0.9926



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### 6. Improve your model
Use grid search to find better parameters. 

In [42]:
from sklearn.model_selection import GridSearchCV
# from sklearn.metrics import make_scorer, f1_score

# Define the parameter grid
parameters = {
    'vectorizer__max_features': [1000, 2000],
    'vectorizer__ngram_range': [(1, 1), (1, 2)],
    'clf__estimator__max_depth': [10, 20],
    'clf__estimator__min_samples_split': [2, 3]
}

# Create gridSearch object
cv = GridSearchCV(pipeline, param_grid=parameters)

# Fit the grid search to the data
cv.fit(X_train, y_train)

# Best parameters found
print("Best parameters found: ", cv.best_params_)


Best parameters found:  {'clf__estimator__max_depth': 20, 'clf__estimator__min_samples_split': 2, 'vectorizer__max_features': 1000, 'vectorizer__ngram_range': (1, 2)}


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [44]:
# Get the best parameters
print("Best parameters found: ", cv.best_params_)

# Predict on the test set using the best model
y_pred_tuned = cv.predict(X_test)

# Generate classification report
for i in range(num_labels):
    print(f"Classification Report for label {i}:")
    report = classification_report(y_test.values[:, i], y_pred_tuned[:, i], output_dict=True)

    for label, metrics in report.items():
        if isinstance(metrics, dict):
            print(f"Label: {label}")
            print(f"Precision: {metrics['precision']:.4f}")
            print(f"Recall: {metrics['recall']:.4f}")
            print(f"F1-Score: {metrics['f1-score']:.4f}")
            print()

Best parameters found:  {'clf__estimator__max_depth': 20, 'clf__estimator__min_samples_split': 2, 'vectorizer__max_features': 1000, 'vectorizer__ngram_range': (1, 2)}
Classification Report for label 0:
Label: 0
Precision: 0.7866
Recall: 0.1019
F1-Score: 0.1804

Label: 1
Precision: 0.7683
Recall: 0.9911
F1-Score: 0.8656

Label: 2
Precision: 0.0000
Recall: 0.0000
F1-Score: 0.0000

Label: macro avg
Precision: 0.5183
Recall: 0.3643
F1-Score: 0.3487

Label: weighted avg
Precision: 0.7669
Recall: 0.7689
F1-Score: 0.6936

Classification Report for label 1:
Label: 0
Precision: 0.8875
Recall: 0.9908
F1-Score: 0.9363

Label: 1
Precision: 0.8972
Recall: 0.3899
F1-Score: 0.5436

Label: macro avg
Precision: 0.8924
Recall: 0.6904
F1-Score: 0.7400

Label: weighted avg
Precision: 0.8892
Recall: 0.8883
F1-Score: 0.8693

Classification Report for label 2:
Label: 0
Precision: 0.9950
Recall: 1.0000
F1-Score: 0.9975

Label: 1
Precision: 0.0000
Recall: 0.0000
F1-Score: 0.0000

Label: macro avg
Precision: 0.

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.