# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [7]:
import pandas as pd
import numpy as np
import nltk
import pickle
import re
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.base import BaseEstimator, TransformerMixin
from sqlalchemy import create_engine
# Download necessary NLTK data
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger'])
from sklearn.datasets import make_multilabel_classification

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [8]:
# load data from database
def load_data():
    engine = create_engine('sqlite:///etl_pipeline.db')
    #print(engine.table_names())
    df = pd.read_sql ('SELECT * FROM messageCat', engine)
    df.head(20)
    X = df['message']
    y = df.iloc[:, 4:40]
    return X, y


In [9]:
### 2. Write a tokenization function to process your text data
def tokenize(texts):
# Remove non-alphanumeric characters
    texts = re.sub(r'[^\w\s]', " " , texts.lower())
    
    # Tokenize the text
    tokens = word_tokenize(texts)
    
    # Initialize the lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # Lemmatize and clean each token
    clean_tokens = [lemmatizer.lemmatize(tok).strip() for tok in tokens]
    
    return clean_tokens

In [10]:

# Example usage
X, y= load_data()
for message in X[:10]:
    print(message)
    print(tokenize(message),'\n')
    

Weather update - a cold front from Cuba that could pass over Haiti
['weather', 'update', 'a', 'cold', 'front', 'from', 'cuba', 'that', 'could', 'pas', 'over', 'haiti'] 

Is the Hurricane over or is it not over
['is', 'the', 'hurricane', 'over', 'or', 'is', 'it', 'not', 'over'] 

Looking for someone but no name
['looking', 'for', 'someone', 'but', 'no', 'name'] 

UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.
['un', 'report', 'leogane', '80', '90', 'destroyed', 'only', 'hospital', 'st', 'croix', 'functioning', 'need', 'supply', 'desperately'] 

says: west side of Haiti, rest of the country today and tonight
['say', 'west', 'side', 'of', 'haiti', 'rest', 'of', 'the', 'country', 'today', 'and', 'tonight'] 

Information about the National Palace-
['information', 'about', 'the', 'national', 'palace'] 

Storm at sacred heart of jesus
['storm', 'at', 'sacred', 'heart', 'of', 'jesus'] 

Please, we need tents and water. We are in Silo, Thank

In [11]:
class StartingVerbExtractor(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        """
        No fitting needed for this transformer.
        """
        return self

    def transform(self, X):
        """
        Extracts whether the first word in the text is a verb or not.
        """
        def starting_verb(text):
            # Check for empty or whitespace-only strings
            if not text or not text.strip():
                return 0
            words = text.split()
            # Check if the first word is title case
            if len(words) > 0 and words[0].istitle():
                return 1
            return 0

        # Ensure input X is iterable
        if not isinstance(X, (list, np.ndarray)):
            X = X.tolist()

        # Apply the starting_verb function to each element in X
        features = [starting_verb(text) for text in X]

        # Return a 2D NumPy array (scikit-learn requires this format)
        return np.array(features).reshape(-1, 1)

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [12]:

def build_model():
    # Define the pipeline
    pipeline = Pipeline([
        ('features', FeatureUnion([
            ('text_pipeline', Pipeline([
                ('vectorizer', CountVectorizer(tokenizer=tokenize)),
                ('tfidf_transformer', TfidfTransformer())
            ])),
            ('starting_verb', StartingVerbExtractor())
        ])),
        ('classifier', MultiOutputClassifier(RandomForestClassifier()))
    ])

    # Specify parameters for grid search
    parameters = {
            'features__text_pipeline__vectorizer__ngram_range': [(1, 1)],
            'classifier__estimator__n_estimators': [50],
            'classifier__estimator__min_samples_split': [2]
        }
    
    # Create and return the grid search object
    gcv = GridSearchCV(
        estimator=pipeline,
        param_grid=parameters,
        cv=3,
        verbose=3,
        n_jobs=-1
    )
    
    return gcv
#print(cv.get_params().keys())

#pipeline = build_model()
#print(pipeline)
#pipeline.get_params()
#print(pipeline.get_params().keys())

In [13]:
pipeline = Pipeline([
    ('vectorizer', CountVectorizer(tokenizer=tokenize)),  
    ('tfidf_transformer', TfidfTransformer()),           
    ('classifier', MultiOutputClassifier(RandomForestClassifier()))  
])
print(pipeline.get_params().keys())



dict_keys(['memory', 'steps', 'vectorizer', 'tfidf_transformer', 'classifier', 'vectorizer__analyzer', 'vectorizer__binary', 'vectorizer__decode_error', 'vectorizer__dtype', 'vectorizer__encoding', 'vectorizer__input', 'vectorizer__lowercase', 'vectorizer__max_df', 'vectorizer__max_features', 'vectorizer__min_df', 'vectorizer__ngram_range', 'vectorizer__preprocessor', 'vectorizer__stop_words', 'vectorizer__strip_accents', 'vectorizer__token_pattern', 'vectorizer__tokenizer', 'vectorizer__vocabulary', 'tfidf_transformer__norm', 'tfidf_transformer__smooth_idf', 'tfidf_transformer__sublinear_tf', 'tfidf_transformer__use_idf', 'classifier__estimator__bootstrap', 'classifier__estimator__class_weight', 'classifier__estimator__criterion', 'classifier__estimator__max_depth', 'classifier__estimator__max_features', 'classifier__estimator__max_leaf_nodes', 'classifier__estimator__min_impurity_decrease', 'classifier__estimator__min_impurity_split', 'classifier__estimator__min_samples_leaf', 'class

In [14]:
### 4. Train pipeline
def train_pipeline( ):
    X, y = load_data()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 1)
    X_train = [str(x) for x in X_train]
    X_test = [str(x) for x in X_test]
    model = build_model()
    model.fit(X_train, y_train)
    y_test_pred = model.predict(X_test)
    y_train_pred = model.predict(X_train)
    col_names = list(y.columns.values)
    print(y_test_pred)
    print(y_train_pred)
    return y_test_pred, y_train_pred, col_names ,y_train, y_test

#y_test_pred, y_train_pred, col_names, y_train, y_test = train_pipeline()
#print(y_test_pred)
#print(y_train_pred)
#print(col_names)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [15]:

    
def get_evaluation_metrics(actual, predicted, column_names, average='macro'):
    """
    Calculate evaluation metrics for a multi-label machine learning model.

    Args:
        actual (array-like): Ground truth labels (2D array or equivalent structure).
        predicted (array-like): Predicted labels from the model (2D array or equivalent structure).
        column_names (list of str): List of field names corresponding to each label.

    Returns:
        pd.DataFrame: A DataFrame containing Accuracy, Precision, Recall, and F1 Score for each label.
    """

    # Ensure inputs are NumPy arrays for consistent processing
    actual = np.asarray(actual)
    predicted = np.asarray(predicted)

    # Calculate metrics for each field and store in a list of dictionaries
    metrics = []
    for i, column in enumerate(column_names):
        metrics.append({
            "Field": column,
            "Accuracy": accuracy_score(actual[:, i], predicted[:, i]),
            "Precision": precision_score(actual[:, i], predicted[:, i], average=average),
            "Recall": recall_score(actual[:, i], predicted[:, i], average=average),
            "F1": f1_score(actual[:, i], predicted[:, i], average=average)
        })

    # Convert metrics into a DataFrame for better presentation
    metrics_df = pd.DataFrame(metrics).set_index("Field")
    return metrics_df

In [16]:
def main():
    X,y = load_data()
    #call pipeline
    y_test_pred, y_train_pred, col_names, y_train, y_test =  train_pipeline()
    #Train
    train_df = get_evaluation_metrics(y_train, y_train_pred, col_names, average='macro')
    print(train_df)
    #Test
    test_df_first = get_evaluation_metrics(y_test, y_test_pred, col_names, average='macro')
    print(test_df_first)
    return train_df, test_df_first
    




In [17]:
main()



Fitting 3 folds for each of 1 candidates, totalling 3 fits
[CV] classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=50, features__text_pipeline__vectorizer__ngram_range=(1, 1) 
[CV] classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=50, features__text_pipeline__vectorizer__ngram_range=(1, 1) 
[CV] classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=50, features__text_pipeline__vectorizer__ngram_range=(1, 1) 
[CV]  classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=50, features__text_pipeline__vectorizer__ngram_range=(1, 1), score=0.1655950850121446, total= 8.5min
[CV]  classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=50, features__text_pipeline__vectorizer__ngram_range=(1, 1), score=0.170452921845978, total= 8.5min
[CV]  classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=50, features__text_pipeline__vectorizer__ngr

[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  9.6min finished


[[1 0 0 ..., 0 0 0]
 [1 1 0 ..., 0 0 0]
 [1 0 0 ..., 0 0 0]
 ..., 
 [1 0 0 ..., 0 0 0]
 [1 0 0 ..., 0 0 0]
 [1 0 0 ..., 0 0 0]]
[[1 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [1 0 0 ..., 0 0 0]
 [2 0 0 ..., 0 0 0]
 [1 0 0 ..., 0 0 0]]
                        Accuracy        F1  Precision    Recall
Field                                                          
related                 0.997524  0.992295   0.998346  0.986430
request                 0.997952  0.996359   0.997643  0.995085
offer                   0.999905  0.994768   0.999952  0.989691
aid_related             0.997857  0.997790   0.998039  0.997546
medical_help            0.997905  0.992867   0.998863  0.987044
medical_products        0.998190  0.990220   0.999050  0.981731
search_and_rescue       0.998809  0.988815   0.999388  0.978705
security                0.999571  0.993752   0.999782  0.987871
military                0.999048  0.992528   0.999508  0.985755
child_alone             1.000000  1.0000

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


(                        Accuracy        F1  Precision    Recall
 Field                                                          
 related                 0.997524  0.992295   0.998346  0.986430
 request                 0.997952  0.996359   0.997643  0.995085
 offer                   0.999905  0.994768   0.999952  0.989691
 aid_related             0.997857  0.997790   0.998039  0.997546
 medical_help            0.997905  0.992867   0.998863  0.987044
 medical_products        0.998190  0.990220   0.999050  0.981731
 search_and_rescue       0.998809  0.988815   0.999388  0.978705
 security                0.999571  0.993752   0.999782  0.987871
 military                0.999048  0.992528   0.999508  0.985755
 child_alone             1.000000  1.000000   1.000000  1.000000
 water                   0.997952  0.991288   0.998548  0.984268
 food                    0.998428  0.995992   0.998927  0.993103
 shelter                 0.997476  0.992113   0.998120  0.986282
 clothing                

In [21]:
train_df, test_df_first = main()
train_df.describe()

Fitting 3 folds for each of 1 candidates, totalling 3 fits
[CV] classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=50, features__text_pipeline__vectorizer__ngram_range=(1, 1) 
[CV] classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=50, features__text_pipeline__vectorizer__ngram_range=(1, 1) 
[CV] classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=50, features__text_pipeline__vectorizer__ngram_range=(1, 1) 
[CV]  classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=50, features__text_pipeline__vectorizer__ngram_range=(1, 1), score=0.16716673810544364, total= 8.4min
[CV]  classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=50, features__text_pipeline__vectorizer__ngram_range=(1, 1), score=0.16630947278182598, total= 8.5min
[CV]  classifier__estimator__min_samples_split=2, classifier__estimator__n_estimators=50, features__text_pipeline__vectorizer__

[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  9.6min finished


[[1 0 0 ..., 0 0 0]
 [1 0 0 ..., 0 0 0]
 [1 0 0 ..., 0 0 0]
 ..., 
 [1 0 0 ..., 0 0 0]
 [1 0 0 ..., 0 0 0]
 [1 0 0 ..., 0 0 0]]
[[1 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 ..., 
 [1 0 0 ..., 0 0 0]
 [2 0 0 ..., 0 0 0]
 [1 0 0 ..., 0 0 0]]
                        Accuracy        F1  Precision    Recall
Field                                                          
related                 0.997428  0.989984   0.998304  0.982026
request                 0.998476  0.997290   0.998521  0.996070
offer                   0.999857  0.992111   0.999928  0.984536
aid_related             0.997762  0.997692   0.997906  0.997481
medical_help            0.997381  0.991067   0.998023  0.984342
medical_products        0.998381  0.991266   0.999150  0.983654
search_and_rescue       0.999429  0.994690   0.999706  0.989779
security                0.999190  0.988070   0.999588  0.977089
military                0.998952  0.991770   0.999459  0.984330
child_alone             1.000000  1.0000

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Unnamed: 0,Accuracy,F1,Precision,Recall
count,36.0,36.0,36.0,36.0
mean,0.998718,0.992656,0.999043,0.9865
std,0.000805,0.00356,0.000721,0.007014
min,0.997285,0.98348,0.997511,0.968354
25%,0.998095,0.991402,0.99849,0.983652
50%,0.998667,0.992185,0.999181,0.985703
75%,0.99944,0.994663,0.999644,0.990416
max,1.0,1.0,1.0,1.0


In [22]:
test_df_first.median()
test_df_first.describe()

Unnamed: 0,Accuracy,F1,Precision,Recall
count,36.0,36.0,36.0,36.0
mean,0.926275,0.503714,0.556245,0.514751
std,0.09018,0.092779,0.115832,0.088676
min,0.570286,0.312744,0.399465,0.342027
25%,0.913381,0.486682,0.48938,0.5
50%,0.954286,0.493457,0.497286,0.5
75%,0.982571,0.497656,0.599072,0.504326
max,1.0,1.0,1.0,1.0


Initally i have used the  RandomForestClassifier and for new model, i am trying to use MultiOutputClassifier(AdaBoostClassifier().


### 6. Improve your model
Use grid search to find better parameters. 

In [26]:



# Create A New Model
def build_new_model():
    pipeline = Pipeline([
        ('features', FeatureUnion([
            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
            ]))
          #('starting_verb', StartingVerbExtractor())
         ])),
        
    
        ('clf', MultiOutputClassifier(
            AdaBoostClassifier(
            base_estimator=DecisionTreeClassifier(max_depth=1)
            )
        ))
    ])

    # Specify parameters for grid search
    param_grid = {
        'clf__estimator__n_estimators': [50, 100],
        'clf__estimator__learning_rate': [0.1, 0.5]
    }

    # Create grid search object
    cv = GridSearchCV(
        estimator=pipeline, 
        param_grid=param_grid, 
        cv=3, 
        verbose=3
    )
    
    return cv

cv = build_new_model()
print('Print CV', cv)


Print CV GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('text_pipeline', Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_d...er='best'),
          learning_rate=1.0, n_estimators=50, random_state=None),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'clf__estimator__n_estimators': [50, 100], 'clf__estimator__learning_rate': [0.1, 0.5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=3)


Best parameters 
clf__estimator__min_samples_split=2, 
clf__estimator__n_estimators=50, 
tfidf__max_df=0.75, 
tfidf__ngram_range=(1, 1) 


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [27]:
def main():
    # Load the data (you can replace this with your actual loading function)
    X, y = make_multilabel_classification(n_samples=1000, n_features=10, n_classes=3, random_state=42)
    
    # Convert each row of X to a string (textual representation of the data)
    X_text = [" ".join(map(str, row)) for row in X]  # Convert numerical features to strings
    
    # Split the dataset into training and testing sets
    X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X_text, y, test_size=0.2, random_state=42)
    
    # Now, X_train_new and X_test_new are lists of strings and should work with the vectorizer
    #print(X_train_new[:5])  # Show the first 5 rows to confirm

    # Reshape X_train_new and X_test_new if needed for your model pipeline
    X_train_new = np.array(X_train_new)
    X_test_new = np.array(X_test_new)

    # Assuming the 'build_new_model()' function and 'get_evaluation_metrics()' function are defined elsewhere
    model = build_new_model()

    # Train the model
    np.random.seed(80)
    model.fit(X_train_new, y_train_new)

    # Get predictions on the training data (or you could use X_test_new here as well)
    y_train_pred = model.predict(X_train_new)
    y_test_pred = model.predict(X_test_new)
    # Evaluate model performance (assuming you have the function implemented)
    col_names = list(range(y.shape[1]))  # Assuming y is a multi-label dataset with columns for each label
    train_df = get_evaluation_metrics(y_train_new, y_train_pred, col_names, average='macro')
    test_df  = get_evaluation_metrics(y_test_new, y_test_pred, col_names, average='macro')
    # Print the evaluation metrics
    print('Train Accuracy', train_df.describe(), '\n')
    print('Test Accuracy', test_df.describe(), '\n')

In [43]:
main()

Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=50 
[CV]  clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=50, score=0.23220973782771537, total=   0.4s
[CV] clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=50 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s


[CV]  clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=50, score=0.25842696629213485, total=   0.4s
[CV] clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=50 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.0s remaining:    0.0s


[CV]  clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=50, score=0.18796992481203006, total=   0.4s
[CV] clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=100 
[CV]  clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=100, score=0.22846441947565543, total=   0.6s
[CV] clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=100 
[CV]  clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=100, score=0.26591760299625467, total=   0.6s
[CV] clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=100 
[CV]  clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=100, score=0.20676691729323307, total=   0.6s
[CV] clf__estimator__learning_rate=0.5, clf__estimator__n_estimators=50 
[CV]  clf__estimator__learning_rate=0.5, clf__estimator__n_estimators=50, score=0.19101123595505617, total=   0.4s
[CV] clf__estimator__learning_rate=0.5, clf__estimator__n_estimators=50 
[CV]  clf__estimator__learning_rate=0.5, clf__estimato

[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:    7.3s finished


Train Accuracy        Accuracy        F1  Precision    Recall
count  3.000000  3.000000   3.000000  3.000000
mean   0.657917  0.577969   0.671249  0.600274
std    0.029377  0.070112   0.014096  0.047565
min    0.628750  0.521944   0.655692  0.562076
25%    0.643125  0.538656   0.665289  0.573636
50%    0.657500  0.555368   0.674886  0.585196
75%    0.672500  0.605981   0.679028  0.619373
max    0.687500  0.656593   0.683170  0.653550 

Test Accuracy        Accuracy        F1  Precision    Recall
count  3.000000  3.000000   3.000000  3.000000
mean   0.596667  0.491283   0.531503  0.528499
std    0.053463  0.093349   0.079065  0.064884
min    0.535000  0.432494   0.461443  0.479269
25%    0.580000  0.437465   0.488640  0.491736
50%    0.625000  0.442435   0.515837  0.504204
75%    0.627500  0.520677   0.566532  0.553114
max    0.630000  0.598920   0.617228  0.602024 



In [None]:
#Bwlow F1 score New Model and old model


#New Model build_new_model() :Random Forest model

Train Accuracy        Accuracy        F1  Precision    Recall
count  3.000000  3.000000   3.000000  3.000000
mean   0.995833  0.995203   0.996970  0.993590
std    0.007217  0.008309   0.005249  0.011103
min    0.987500  0.985609   0.990909  0.980769
25%    0.993750  0.992804   0.995455  0.990385
50%    1.000000  1.000000   1.000000  1.000000
75%    1.000000  1.000000   1.000000  1.000000
max    1.000000  1.000000   1.000000  1.000000 

Test Accuracy        Accuracy        F1  Precision    Recall
count  3.000000  3.000000   3.000000  3.000000
mean   0.583333  0.556671   0.559707  0.556746
std    0.057735  0.041128   0.043404  0.039406
min    0.550000  0.520000   0.521978  0.520833
25%    0.550000  0.534436   0.535989  0.535669
50%    0.550000  0.548872   0.550000  0.550505
75%    0.600000  0.575006   0.578571  0.574703
max    0.650000  0.601140   0.607143  0.598901

#Old Model build_model() :AdaBoostClassifier 
Test  
	Accuracy	F1	Precision	Recall
count	36.000000	36.000000	36.000000	36.000000
mean	0.923884	0.506563	0.536845	0.515260
std	0.093727	0.090605	0.108180	0.088072
min	0.558476	0.344810	0.374710	0.352867
25%	0.912238	0.488003	0.490330	0.499806
50%	0.954190	0.494219	0.497857	0.500000
75%	0.982238	0.497944	0.549052	0.503678
max	1.000000	1.000000	1.000000	1.000000

Train 
Accuracy	F1	Precision	Recall
count	36.000000	36.000000	36.000000	36.000000
mean	0.987800	0.930657	0.992305	0.885316
std	0.008690	0.019244	0.005914	0.034609
min	0.968711	0.904986	0.976969	0.842318
25%	0.982201	0.919542	0.989128	0.863874
50%	0.988666	0.925491	0.993500	0.875053
75%	0.994940	0.933499	0.996952	0.892159
max	1.000000	1.000000	1.000000	1.000000




### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

After multiple attempts and evaluations, the Random Forest model demonstrates better performance. The new model using AdaBoostClassifier does not show any improvement in performance. Therefore, the Random Forest model will be retained as the preferred choice. It is evident that the previous model outperforms the new approach


### 9. Export your model as a pickle file

In [28]:
best_model = build_model()
#best_model.fit(X_train, y_train) 
#export_model(best_model)
# Save the model to a file
#with open('classifier.pkl', 'wb') as model_file:
#    pickle.dump(best_model, model_file)

model_file = 'classifier.pkl'
pickle.dump(best_model, open(model_file, 'wb'))
print("Model saved successfully!")
# some time later...
 


Model saved successfully!


In [None]:
# load the model from disk
#loaded_model = pickle.load(open(model_file, 'rb'))
#loaded_model
#print('successfully loaded')
#X, y = load_data()

#call pipeline
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 1)
#X_train = [str(x) for x in X_train]
#X_test = [str(x) for x in X_test]
#Train
#loaded_model.fit(X_train, y_train)
#print('successfully loaded')
#y_test_pred = loaded_model.predict(X_test)
#y_train_pred = loaded_model.predict(X_train)
#col_names = list(y.columns.values)
#train_df = get_evaluation_metrics(y_train, y_train_pred, col_names, average='macro')
#print(train_df)

### 10. Use this notebook to complete `train_classifier.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.