# ML Pipeline 
按照如下的指导要求，搭建你的机器学习管道。
### 1. 导入与加载
- 导入 Python 库
- 使用 [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html) 从数据库中加载数据集
- 定义特征变量X 和目标变量 Y

In [69]:
# import libraries
# TODO 1
import pandas as pd
import numpy as np
import re
import pickle
#NLTK libraries
import nltk
nltk.download(['punkt', 'wordnet', 'stopwords', 'averaged_perceptron_tagger'])
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

from sqlalchemy import create_engine

from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.multioutput import MultiOutputClassifier

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [70]:
# load data from database
# TODO 2
engine = create_engine('sqlite:///Disaster_Response.db')
df = pd.read_sql_table('DisasterResponsePipeline_table',engine)
X = df['message']
Y = df.iloc[:, 4:]
Y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. 编写分词函数，开始处理文本

In [5]:
X[0]

'Weather update - a cold front from Cuba that could pass over Haiti'

In [71]:
def tokenize(text):
    # 使用正则表达式清理数据
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    tokens = word_tokenize(text)
    
    # lemmatizer并去掉停用词
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        if tok not in nltk.corpus.stopwords.words('english'):
            clean_tok = lemmatizer.lemmatize(tok)
            clean_tokens.append(clean_tok)

    return clean_tokens


In [56]:
text = X[0]
# tokenize(text)
print(text)
tokenize(text)

Weather update - a cold front from Cuba that could pass over Haiti


['weather', 'update', 'cold', 'front', 'cuba', 'could', 'pas', 'haiti']

### 3. 创建机器学习管道 
这个机器学习管道应该接收 `message` 列作输入，输出分类结果，分类结果属于该数据集中的 36 个类。你会发现 [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) 在预测多目标变量时很有用。

In [72]:
#创建一个机器学习管道
pipeline = Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer()),
                ('clf', MultiOutputClassifier(RandomForestClassifier()))
            ])

### 4. 训练管道
- 将数据分割成训练和测试集
- 训练管道

In [59]:
# 分割数据为训练和测试集
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .30, random_state=42)

#训练数据来拟合模型S
pipeline.fit(X_train, Y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

### 5. 测试模型
报告数据集中每个输出类别的 f1 得分、准确度和召回率。你可以对列进行遍历，并对每个元素调用 sklearn 的 `classification_report`。

In [60]:
# 使用测试数据进行预测
#得到当前模型下的对测试数据集的预测结果
Y_pred_test = pipeline.predict(X_test)


In [45]:
Y_pred_test.shape


(5244, 36)

In [46]:
Y_test.shape

(5244, 36)

In [62]:
#classification_report函数用于显示主要分类指标的文本报告．
#在报告中显示每个类的精确度，召回率，F1值等信息
Y_pred_test = pd.DataFrame(Y_pred_test, columns = Y_test.columns)
for column in Y_test.columns:
    print('类别： {}'.format(column))
    print(classification_report(Y_test[column], Y_pred_test[column]))

类别： related
             precision    recall  f1-score   support

          0       0.64      0.48      0.55      1581
          1       0.84      0.91      0.87      4925
          2       0.27      0.38      0.32        48

avg / total       0.79      0.80      0.79      6554

类别： request
             precision    recall  f1-score   support

          0       0.89      0.98      0.93      5435
          1       0.79      0.44      0.57      1119

avg / total       0.88      0.88      0.87      6554

类别： offer
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6527
          1       0.00      0.00      0.00        27

avg / total       0.99      1.00      0.99      6554

类别： aid_related
             precision    recall  f1-score   support

          0       0.76      0.85      0.80      3868
          1       0.74      0.61      0.67      2686

avg / total       0.75      0.75      0.75      6554

类别： medical_help
             precisi

  'precision', 'predicted', average, warn_for)


### 6. 优化模型
使用网格搜索来找到最优的参数组合。 

In [73]:
#设置网格搜索参数
parameters = {'tfidf__norm': ['l1', 'l2'],
            'tfidf__sublinear_tf': [True, False]}

cv = GridSearchCV(pipeline, param_grid=parameters,cv=5, verbose=3, n_jobs=-1)

In [74]:
#训练优化后的m模型
cv.fit(X_train, Y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] tfidf__norm=l1, tfidf__sublinear_tf=True ........................
[CV]  tfidf__norm=l1, tfidf__sublinear_tf=True, score=0.22679888126112382, total= 2.4min
[CV] tfidf__norm=l1, tfidf__sublinear_tf=True ........................


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  3.8min remaining:    0.0s


[CV]  tfidf__norm=l1, tfidf__sublinear_tf=True, score=0.24764810577167556, total= 2.5min
[CV] tfidf__norm=l1, tfidf__sublinear_tf=True ........................


[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:  7.6min remaining:    0.0s


[CV]  tfidf__norm=l1, tfidf__sublinear_tf=True, score=0.2461851475076297, total= 2.4min
[CV] tfidf__norm=l1, tfidf__sublinear_tf=True ........................
[CV]  tfidf__norm=l1, tfidf__sublinear_tf=True, score=0.2461851475076297, total= 2.4min
[CV] tfidf__norm=l1, tfidf__sublinear_tf=True ........................
[CV]  tfidf__norm=l1, tfidf__sublinear_tf=True, score=0.23143438453713122, total= 2.4min
[CV] tfidf__norm=l1, tfidf__sublinear_tf=False .......................
[CV]  tfidf__norm=l1, tfidf__sublinear_tf=False, score=0.23290109331299264, total= 2.4min
[CV] tfidf__norm=l1, tfidf__sublinear_tf=False .......................
[CV]  tfidf__norm=l1, tfidf__sublinear_tf=False, score=0.24383422323925757, total= 2.4min
[CV] tfidf__norm=l1, tfidf__sublinear_tf=False .......................
[CV]  tfidf__norm=l1, tfidf__sublinear_tf=False, score=0.23728382502543235, total= 2.4min
[CV] tfidf__norm=l1, tfidf__sublinear_tf=False .......................
[CV]  tfidf__norm=l1, tfidf__sublinear_

[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed: 75.7min finished


GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'tfidf__norm': ['l1', 'l2'], 'tfidf__sublinear_tf': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=3)

In [75]:
#输出最优参数结果
cv.best_params_

{'tfidf__norm': 'l2', 'tfidf__sublinear_tf': True}

### 7. 测试模型
打印微调后的模型的精确度、准确率和召回率。  

因为本项目主要关注代码质量、开发流程和管道技术，所有没有模型性能指标的最低要求。但是，微调模型提高精确度、准确率和召回率可以让你的项目脱颖而出——特别是让你的简历更出彩。

In [76]:
#使用微调后的模型进行预测
Y_pred_test2 = cv.predict(X_test)

In [77]:
#classification_report函数用于显示主要分类指标的文本报告．
#在报告中显示每个类的精确度，召回率，F1值等信息
Y_pred_test2 = pd.DataFrame(Y_pred_test2, columns = Y_test.columns)
for column in Y_test.columns:
    print('类别： {}'.format(column))
    print(classification_report(Y_test[column], Y_pred_test2[column]))

类别： related
             precision    recall  f1-score   support

          0       0.63      0.46      0.53      1581
          1       0.84      0.91      0.87      4925
          2       0.20      0.35      0.25        48

avg / total       0.78      0.79      0.78      6554

类别： request
             precision    recall  f1-score   support

          0       0.90      0.98      0.94      5435
          1       0.84      0.45      0.59      1119

avg / total       0.89      0.89      0.88      6554

类别： offer
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6527
          1       0.00      0.00      0.00        27

avg / total       0.99      1.00      0.99      6554

类别： aid_related
             precision    recall  f1-score   support

          0       0.76      0.86      0.81      3868
          1       0.75      0.61      0.67      2686

avg / total       0.76      0.76      0.75      6554

类别： medical_help
             precisi

  'precision', 'predicted', average, warn_for)


In [79]:
# 查看模型整体的准确率
Y_test = Y_test.reset_index(drop=True)
overall_accuracy = (Y_pred_test2 == Y_test).mean().mean()

print('\n\nAverage overall accuracy {0:.4f}% \n'.format(overall_accuracy*100))



Average overall accuracy 94.5699% 



### 8. 继续优化模型，比如：
* 尝试其他的机器学习算法
* 尝试除 TF-IDF 外其他的特征

In [80]:
# 尝试使用AdaBoost分类器来查看模型效果
pipeline_AdaBoost = Pipeline([('vect', CountVectorizer(tokenizer = tokenize)),
                     ('tfidf', TfidfTransformer()),
                     ('clf_2', MultiOutputClassifier(AdaBoostClassifier()))
                     ])

pipeline_AdaBoost.fit(X_train, Y_train)

Pipeline(memory=None,
     steps=[('cvect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        stri...mator=None,
          learning_rate=1.0, n_estimators=50, random_state=None),
           n_jobs=1))])

In [81]:
Y_ada_test = pipeline_AdaBoost.predict(X_test)


In [82]:
#classification_report函数用于显示主要分类指标的文本报告．
#在报告中显示每个类的精确度，召回率，F1值等信息
Y_ada_test = pd.DataFrame(Y_ada_test, columns = Y_test.columns)
for column in Y_test.columns:
    print('类别： {}'.format(column))
    print(classification_report(Y_test[column], Y_ada_test[column]))

类别： related
             precision    recall  f1-score   support

          0       0.66      0.12      0.20      1581
          1       0.77      0.98      0.86      4925
          2       0.50      0.17      0.25        48

avg / total       0.74      0.77      0.70      6554

类别： request
             precision    recall  f1-score   support

          0       0.91      0.96      0.94      5435
          1       0.76      0.54      0.63      1119

avg / total       0.88      0.89      0.88      6554

类别： offer
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      6527
          1       0.00      0.00      0.00        27

avg / total       0.99      0.99      0.99      6554

类别： aid_related
             precision    recall  f1-score   support

          0       0.77      0.86      0.81      3868
          1       0.76      0.63      0.69      2686

avg / total       0.77      0.77      0.76      6554

类别： medical_help
             precisi

In [83]:

overall_accuracy = (Y_ada_test == Y_test).mean().mean()

print('\n\nAverage overall accuracy {0:.4f}% \n'.format(overall_accuracy*100))



Average overall accuracy 94.8123% 



### 9. 导出模型为 pickle file

In [84]:
with open("classifier_my.pickle", 'wb') as pickle_file:
    pickle.dump(cv, pickle_file)

### 10. Use this notebook to complete `train.py`
使用资源 (Resources)文件里附带的模板文件编写脚本，运行上述步骤，创建一个数据库，并基于用户指定的新数据集输出一个模型。