# ML Pipeline 
按照如下的指导要求，搭建你的机器学习管道。
### 1. 导入与加载
- 导入 Python 库
- 使用 [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html) 从数据库中加载数据集
- 定义特征变量X 和目标变量 Y

In [12]:
# import libraries
from sqlalchemy import create_engine
import re
import pickle
import nltk

import re
import numpy as np
import pandas as pd
import pickle
from sqlalchemy import create_engine

from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

from sklearn.metrics import classification_report, accuracy_score, fbeta_score, make_scorer,confusion_matrix,classification_report,fbeta_score,make_scorer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn import multioutput
from sklearn.multioutput import MultiOutputClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC

import matplotlib.pyplot as plt

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [13]:
# load data from database
engine = create_engine('sqlite:///DisasterData.db')
df = pd.read_sql_table('DisasterData',engine)
X = df.message.values
y = df.iloc[:,4:].values

### 2. 编写分词函数，开始处理文本

In [14]:
def tokenize(text):
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    words = word_tokenize(text)
    
    stop_words = stopwords.words("english")
        
    tokens = [WordNetLemmatizer().lemmatize(word) for word in words if word not in stop_words]
    
    return tokens

### 3. 创建机器学习管道 
这个机器学习管道应该接收 `message` 列作输入，输出分类结果，分类结果属于该数据集中的 36 个类。你会发现 [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) 在预测多目标变量时很有用。

In [15]:
pipeline = Pipeline([
        ('vectorizer', CountVectorizer(tokenizer=tokenize)),
        ('transformer', TfidfTransformer()),
        ('clf', multioutput.MultiOutputClassifier(DecisionTreeClassifier(random_state=10),n_jobs=-1))
        ])

### 4. 训练管道
- 将数据分割成训练和测试集
- 训练管道

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state = 10)
X_train.shape, y_train.shape

((13193,), (13193, 36))

In [17]:
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
       ...tion_leaf=0.0, presort=False, random_state=10,
            splitter='best'),
           n_jobs=-1))])

### 5. 测试模型
报告数据集中每个输出类别的 f1 得分、准确度和召回率。你可以对列进行遍历，并对每个元素调用 sklearn 的 `classification_report`。

In [18]:
y_pred = pipeline.predict(X_test)

for i in range(10):
    print(classification_report(y_test[:,i],y_pred[:,i]) )

             precision    recall  f1-score   support

          0       0.53      0.49      0.51      3077
          1       0.85      0.85      0.85     10016
          2       0.14      0.39      0.21       100

avg / total       0.77      0.76      0.77     13193

             precision    recall  f1-score   support

          0       0.91      0.92      0.91     10922
          1       0.59      0.55      0.57      2271

avg / total       0.85      0.86      0.86     13193

             precision    recall  f1-score   support

          0       0.99      1.00      1.00     13124
          1       0.04      0.03      0.03        69

avg / total       0.99      0.99      0.99     13193

             precision    recall  f1-score   support

          0       0.75      0.77      0.76      7691
          1       0.66      0.64      0.65      5502

avg / total       0.71      0.71      0.71     13193

             precision    recall  f1-score   support

          0       0.94      0.95 

In [19]:
for i in range(10):
    print(accuracy_score(y_test[:,i],y_pred[:,i]) )

0.764875312666
0.858030773895
0.991510649587
0.711286288183
0.901690290305
0.938149018419
0.960130372167
0.96861972258
0.957780641249
1.0


### 6. 优化模型
使用网格搜索来找到最优的参数组合。 

In [20]:
pipeline.get_params()

{'memory': None,
 'steps': [('vectorizer',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7ff5ee3a7620>, vocabulary=None)),
  ('transformer',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
               max_features=None, max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, presort=False, random_state=10,
               splitter='best'),
              n_jo

In [21]:
parameters = {
             'clf__estimator__min_samples_split':[3,4]
             }

cv = GridSearchCV(pipeline, parameters)

In [22]:
cv.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
       ...tion_leaf=0.0, presort=False, random_state=10,
            splitter='best'),
           n_jobs=-1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'clf__estimator__min_samples_split': [3, 4]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [23]:
cv.best_score_

0.16789206397331918

### 7. 测试模型
打印微调后的模型的精确度、准确率和召回率。  

因为本项目主要关注代码质量、开发流程和管道技术，所有没有模型性能指标的最低要求。但是，微调模型提高精确度、准确率和召回率可以让你的项目脱颖而出——特别是让你的简历更出彩。

In [24]:
y_pred = cv.predict(X_test)

for i in range(10):
    print(classification_report(y_test[:,i],y_pred[:,i]) )
for i in range(10):
    print(accuracy_score(y_test[:,i],y_pred[:,i]) )

             precision    recall  f1-score   support

          0       0.53      0.49      0.51      3077
          1       0.85      0.86      0.85     10016
          2       0.15      0.42      0.23       100

avg / total       0.77      0.77      0.77     13193

             precision    recall  f1-score   support

          0       0.91      0.92      0.91     10922
          1       0.59      0.54      0.56      2271

avg / total       0.85      0.86      0.85     13193

             precision    recall  f1-score   support

          0       0.99      1.00      1.00     13124
          1       0.05      0.03      0.04        69

avg / total       0.99      0.99      0.99     13193

             precision    recall  f1-score   support

          0       0.74      0.76      0.75      7691
          1       0.66      0.63      0.64      5502

avg / total       0.71      0.71      0.71     13193

             precision    recall  f1-score   support

          0       0.94      0.95 

In [25]:
cv.best_score_

0.16789206397331918

### 8. 继续优化模型，比如：
* 尝试其他的机器学习算法
* 尝试除 TF-IDF 外其他的特征

In [26]:
pipeline2 = Pipeline([
        ('vectorizer', CountVectorizer(tokenizer=tokenize)),
        ('transformer', TfidfTransformer()),
        ('clf', multioutput.MultiOutputClassifier(RandomForestClassifier(),n_jobs=-1))
      ])

pipeline2.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
       ...ob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=-1))])

In [28]:
y_pred2 = pipeline2.predict(X_test)

for i in range(10):
    print(classification_report(y_test[:,i], y_pred2[:,i]))
for i in range(10):
    print(accuracy_score(y_test[:,i],y_pred2[:,i]))

             precision    recall  f1-score   support

          0       0.61      0.44      0.51      3077
          1       0.84      0.91      0.87     10016
          2       0.39      0.26      0.31       100

avg / total       0.78      0.80      0.78     13193

             precision    recall  f1-score   support

          0       0.89      0.97      0.93     10922
          1       0.77      0.44      0.56      2271

avg / total       0.87      0.88      0.87     13193

             precision    recall  f1-score   support

          0       0.99      1.00      1.00     13124
          1       1.00      0.03      0.06        69

avg / total       0.99      0.99      0.99     13193

             precision    recall  f1-score   support

          0       0.76      0.85      0.80      7691
          1       0.75      0.62      0.68      5502

avg / total       0.75      0.75      0.75     13193

             precision    recall  f1-score   support

          0       0.93      0.99 

### 9. 导出模型为 pickle file

In [30]:
pickle.dump(pipeline, open('model.sav', 'wb'))

### 10. Use this notebook to complete `train.py`
使用资源 (Resources)文件里附带的模板文件编写脚本，运行上述步骤，创建一个数据库，并基于用户指定的新数据集输出一个模型。