
# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_query`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_query.html)
- Define feature and target variables X and Y

In [None]:
# import libraries
import nltk
nltk.download(['punkt', 'wordnet'])
nltk.download('stopwords')


In [None]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [None]:
# import statements
import pandas as pd
from sqlalchemy import create_engine
from nltk.corpus import stopwords
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

In [None]:
# load data from database
engine = create_engine('sqlite:///etlProject.db')
df = pd.read_sql_query('select * from cleanData', engine)

In [None]:
df[['message','genre']].head()

In [None]:
df.head()

In [None]:
X = df['message'].values
y = df.drop(['id','message','original','genre'], axis=1)

In [None]:
y.columns

In [None]:
for column in y.columns:
    print(column,y[column].unique())

In [None]:
y[y["related"] == 2]

### 2. Write a tokenization function to process your text data

In [None]:
def tokenize(text):
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)
    
    clean_tokens = [w for w in clean_tokens if w not in stopwords.words("english")]
            
    return clean_tokens

### 3. Build a machine learning pipeline
- You'll find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [None]:
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [None]:
import numpy as np

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
X_train.shape

In [None]:
y_train.values.shape

In [None]:
np.unique(y_test.values.flatten())

In [None]:
pipeline.fit(X_train, y_train.values)

### 5. Test your model
Report the accuracy, precision and recall on both the training set and the test set. You can use sklearn's `classification_report` function here. 

In [None]:
y_pred = pipeline.predict(X_test)

In [None]:
print(classification_report(y_pred, y_test.values))

### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
parameters = {
        'vect__ngram_range': ((1, 1), (1, 2)),
        'clf__estimator__min_samples_split': [2, 3, 4],
    }

cv = GridSearchCV(pipeline, param_grid=parameters)

In [None]:
cv.fit(X_train, y_train.values)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.

In [None]:
new_y_preds = cv.predict(X_test)

In [None]:
print(classification_report(new_y_pred, y_test.values))

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.