#### 超参数：模型的配置我们统一称为模型的超参数Hyper parameters，如K近邻算法中的K值，支持向量机中的核函数等。
#### 我们可以通过启发式的搜索方式对超参数进行组合调优

### 网格搜索

In [1]:
##### 由于超参数的个数是无尽的，因此超参数的组合配置只能是更优解，没有最优解。
##### 通常情况下，我们依靠网格搜索GridSearch对多种超参数的组合空间进行暴力搜索，每一套超参数组合被带入到学习函数中作为新的模型，
##### 并且为了比较模型之间的性能，每个模型都会采用交叉验证的方法在多组相同的训练和开发数据集下进行评估

#### 使用单线程对文本分类的朴素贝叶斯模型的超参数组合进行网格搜索

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.cross_validation import train_test_split
import numpy as np
news = fetch_20newsgroups(subset='all')
x_trian,x_test,y_train,y_test = train_test_split(news.data[:3000],news.target[:3000],test_size=0.25,random_state=123)



In [7]:
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

In [8]:
clf = Pipeline([('vect',TfidfVectorizer(stop_words='english',analyzer='word')),('svc',SVC())])

In [9]:
parameters = {'svc__gamma':np.logspace(-2,1,4),'svc__C':np.logspace(-1,1,3)}

In [10]:
np.logspace(-2,1,4)

array([  0.01,   0.1 ,   1.  ,  10.  ])

In [11]:
#导入网格搜索模块

In [12]:
from sklearn.grid_search import GridSearchCV
#12组参数组合，pipeline，3折交叉验证，refit置为True,cv指定交叉验证为3折
gs = GridSearchCV(clf,parameters,verbose=2,refit=True,cv=3)

In [13]:
#执行单线程网格搜索
time = gs.fit(x_trian,y_train)

Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] svc__C=0.1, svc__gamma=0.01 .....................................
[CV] ............................ svc__C=0.1, svc__gamma=0.01 -   8.1s
[CV] svc__C=0.1, svc__gamma=0.01 .....................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    8.1s remaining:    0.0s


[CV] ............................ svc__C=0.1, svc__gamma=0.01 -   7.4s
[CV] svc__C=0.1, svc__gamma=0.01 .....................................
[CV] ............................ svc__C=0.1, svc__gamma=0.01 -   7.7s
[CV] svc__C=0.1, svc__gamma=0.1 ......................................
[CV] ............................. svc__C=0.1, svc__gamma=0.1 -   8.1s
[CV] svc__C=0.1, svc__gamma=0.1 ......................................
[CV] ............................. svc__C=0.1, svc__gamma=0.1 -   7.3s
[CV] svc__C=0.1, svc__gamma=0.1 ......................................
[CV] ............................. svc__C=0.1, svc__gamma=0.1 -   7.9s
[CV] svc__C=0.1, svc__gamma=1.0 ......................................
[CV] ............................. svc__C=0.1, svc__gamma=1.0 -   7.9s
[CV] svc__C=0.1, svc__gamma=1.0 ......................................
[CV] ............................. svc__C=0.1, svc__gamma=1.0 -   7.2s
[CV] svc__C=0.1, svc__gamma=1.0 ......................................
[CV] .

[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:  4.3min finished


In [14]:
#输出最佳模型在测试集上的准确性
print(gs.score(x_test,y_test))

0.785333333333


In [15]:
gs.best_params_

{'svc__C': 10.0, 'svc__gamma': 0.10000000000000001}

In [16]:
gs.best_estimator_.score(x_test,y_test)

0.78533333333333333

### 并行搜索

#### 采用网格搜索参数组合过程非常耗时，可喜的是，各个新模块在执行交叉验证的过程中间是相互独立的，所以我们可以充分利用多核处理器
#### 甚至是分布式的计算资源来实现并行搜索。

In [17]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.cross_validation import train_test_split
import numpy as np
news = fetch_20newsgroups(subset='all')
x_trian,x_test,y_train,y_test = train_test_split(news.data[:3000],news.target[:3000],test_size=0.25,random_state=123)
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
clf = Pipeline([('vect',TfidfVectorizer(stop_words='english',analyzer='word')),('svc',SVC())])
parameters = {'svc__gamma':np.logspace(-2,1,4),'svc__C':np.logspace(-1,1,3)}
from sklearn.grid_search import GridSearchCV
#12组参数组合，pipeline，3折交叉验证，refit置为True,cv指定交叉验证为3折,n_jobs=-1表示使用计算机全部的CPU
gs = GridSearchCV(clf,parameters,verbose=2,refit=True,cv=3,n_jobs=-1)
#执行单线程网格搜索
time = gs.fit(x_trian,y_train)
#输出最佳模型在测试集上的准确性
print(gs.score(x_test,y_test))

Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed:  2.4min finished


0.785333333333
