# ML Pipeline 
按照如下的指导要求，搭建你的机器学习管道。
### 导入与加载
- 导入 Python 库
- 使用 [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html) 从数据库中加载数据集
- 定义特征变量X 和目标变量 Y

In [1]:
# import libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import sqlite3
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tendays\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\tendays\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tendays\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
# load data from database
engine = create_engine('sqlite:///DisasterData.db')
df = pd.read_sql_table(table_name='DisasterData',con=engine,index_col='id')
X = df.message.values
y = df.iloc[:,4:].values

### 编写分词函数，开始处理文本

In [3]:
def tokenize(text):
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    words = word_tokenize(text)    
    stop_words = stopwords.words("english")        
    tokens = [WordNetLemmatizer().lemmatize(word) for word in words if word not in stop_words]    
    return tokens


### 创建机器学习管道 
这个机器学习管道应该接收 `message` 列作输入，输出分类结果，分类结果属于该数据集中的 36 个类。你会发现 [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) 在预测多目标变量时很有用。

In [8]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.multioutput import MultiOutputClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.tree import DecisionTreeClassifier

pipeline = Pipeline([
        ('vectorizer', CountVectorizer(tokenizer=tokenize)),
        ('transformer', TfidfTransformer()),
        ('clf', MultiOutputClassifier( DecisionTreeClassifier(random_state =10), n_jobs = 1))
         ])

### 训练管道
- 将数据分割成训练和测试集
- 训练管道

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state = 10)
X_train.shape, y_train.shape

((13193,), (13193, 35))

In [10]:
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
       ...ction_leaf=0.0, presort=False, random_state=10,
            splitter='best'),
           n_jobs=1))])

### 测试模型
报告数据集中每个输出类别的 f1 得分、准确度和召回率。你可以对列进行遍历，并对每个元素调用 sklearn 的 `classification_report`。

In [15]:
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

y_pred = pipeline.predict(X_test)

for i in range(10):
    print(classification_report(y_test[:,i],y_pred[:,i]) )

             precision    recall  f1-score   support

          0       0.91      0.92      0.91     10922
          1       0.59      0.55      0.57      2271

avg / total       0.85      0.86      0.86     13193

             precision    recall  f1-score   support

          0       0.99      1.00      1.00     13124
          1       0.04      0.03      0.03        69

avg / total       0.99      0.99      0.99     13193

             precision    recall  f1-score   support

          0       0.75      0.77      0.76      7691
          1       0.66      0.64      0.65      5502

avg / total       0.71      0.71      0.71     13193

             precision    recall  f1-score   support

          0       0.94      0.95      0.95     12152
          1       0.37      0.34      0.35      1041

avg / total       0.90      0.90      0.90     13193

             precision    recall  f1-score   support

          0       0.97      0.97      0.97     12501
          1       0.40      0.37 

In [16]:
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

for i in range(10):
    print(accuracy_score(y_test[:,i],y_pred[:,i]) )

0.8580307738952475
0.9915106495869022
0.7112862881831274
0.901690290305465
0.9381490184188584
0.9601303721670583
0.9686197225801562
0.9577806412491473
1.0
0.9570984613052377


### 优化模型
使用网格搜索来找到最优的参数组合。 

In [23]:
from sklearn.model_selection import GridSearchCV

parameters = {
             'clf__estimator__min_samples_split':[3,4]
             }

cv = GridSearchCV(pipeline, parameters)

In [24]:
cv.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
       ...ction_leaf=0.0, presort=False, random_state=10,
            splitter='best'),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'clf__estimator__min_samples_split': [3, 4]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [25]:
cv.best_score_

0.25642386113848253

### 测试模型
打印微调后的模型的精确度、准确率和召回率。  

因为本项目主要关注代码质量、开发流程和管道技术，所有没有模型性能指标的最低要求。但是，微调模型提高精确度、准确率和召回率可以让你的项目脱颖而出——特别是让你的简历更出彩。

In [26]:
y_pred = cv.predict(X_test)

for i in range(10):
    print(classification_report(y_test[:,i],y_pred[:,i]) )
for i in range(10):
    print(accuracy_score(y_test[:,i],y_pred[:,i]) )

             precision    recall  f1-score   support

          0       0.91      0.92      0.91     10922
          1       0.58      0.54      0.56      2271

avg / total       0.85      0.85      0.85     13193

             precision    recall  f1-score   support

          0       0.99      1.00      1.00     13124
          1       0.05      0.03      0.04        69

avg / total       0.99      0.99      0.99     13193

             precision    recall  f1-score   support

          0       0.75      0.76      0.75      7691
          1       0.66      0.64      0.65      5502

avg / total       0.71      0.71      0.71     13193

             precision    recall  f1-score   support

          0       0.94      0.95      0.95     12152
          1       0.36      0.34      0.35      1041

avg / total       0.90      0.90      0.90     13193

             precision    recall  f1-score   support

          0       0.97      0.97      0.97     12501
          1       0.43      0.37 

### 继续优化模型，比如：
* 尝试其他的机器学习算法
* 尝试除 TF-IDF 外其他的特征

In [34]:
from sklearn import multioutput
from sklearn.multioutput import MultiOutputClassifier

pipeline2 = Pipeline([
        ('vectorizer', CountVectorizer(tokenizer=tokenize)),
        ('transformer', TfidfTransformer()),
        ('clf', multioutput.MultiOutputClassifier(RandomForestClassifier(),n_jobs=-1))
      ])

pipeline2.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
       ...ob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=-1))])

In [35]:
y_pred2 = pipeline2.predict(X_test)

for i in range(10):
    print(classification_report(y_test[:,i], y_pred2[:,i]))
for i in range(10):
    print(accuracy_score(y_test[:,i],y_pred2[:,i]))

             precision    recall  f1-score   support

          0       0.83      0.99      0.90     10922
          1       0.38      0.02      0.04      2271

avg / total       0.75      0.83      0.76     13193

             precision    recall  f1-score   support

          0       0.99      1.00      1.00     13124
          1       0.00      0.00      0.00        69

avg / total       0.99      0.99      0.99     13193

             precision    recall  f1-score   support

          0       0.61      0.91      0.73      7691
          1       0.58      0.17      0.26      5502

avg / total       0.59      0.60      0.53     13193

             precision    recall  f1-score   support

          0       0.92      1.00      0.96     12152
          1       0.12      0.01      0.01      1041

avg / total       0.86      0.92      0.88     13193

             precision    recall  f1-score   support

          0       0.95      1.00      0.97     12501
          1       0.18      0.01 

  'precision', 'predicted', average, warn_for)


### 导出模型为 pickle file

In [42]:
import pickle 

#保存成Python支持的文件格式Pickle
#在当前目录下可以看到pickle
with open('pipeline.pkl','wb') as fw:
    pickle.dump(pipeline2,fw)
#加载
with open('pipeline.pkl','rb') as fr:
    new_pipeline2 = pickle.load(fr)

### Use this notebook to complete `train.py`
使用资源 (Resources)文件里附带的模板文件编写脚本，运行上述步骤，创建一个数据库，并基于用户指定的新数据集输出一个模型。