## Forward Selection by AIC

ref: https://blog.csdn.net/weixin_44835596/article/details/89763300

在建立多元线性回归模型时，变量过多，且有不显著的变量时，可以使用AIC准则结合逐步回归进行变量筛选

In [7]:
import warnings
warnings.filterwarnings("ignore")

In [8]:
#定义向前逐步回归函数
def forward_select(data,target):
    variate=set(data.columns)  #将字段名转换成字典类型
    variate.remove(target)  #去掉因变量的字段名
    selected=[]
    current_score,best_new_score=float('inf'),float('inf')  #目前的分数和最好分数初始值都为无穷大（因为AIC越小越好）
    #循环筛选变量
    while variate:
        aic_with_variate=[]
        for candidate in variate:  #逐个遍历自变量
            formula="{}~{}".format(target,"+".join(selected+[candidate]))  #将自变量名连接起来
            aic=ols(formula=formula,data=data).fit().aic  #利用ols训练模型得出aic值
            aic_with_variate.append((aic,candidate))  #将第每一次的aic值放进空列表
        aic_with_variate.sort(reverse=True)  #降序排序aic值
        best_new_score,best_candidate=aic_with_variate.pop()
        # pop() return and remove the item
        #最好的aic值等于删除列表的最后一个值，以及最好的自变量等于列表最后一个自变量
        if current_score>best_new_score:  #如果目前的aic值大于最好的aic值
            variate.remove(best_candidate)  #移除加进来的变量名，即第二次循环时，不考虑此自变量了
            selected.append(best_candidate)  #将此自变量作为加进模型中的自变量
            current_score=best_new_score  #最新的分数等于最好的分数
            print("aic is {},continuing!".format(current_score))  #输出最小的aic值
        else:
            print("for selection over!")
            break
    formula="{}~{}".format(target,"+".join(selected))  #最终的模型式子
    print("final formula is {}".format(formula))
    model=ols(formula=formula,data=data).fit()
    return(model)

In [9]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols #加载ols模型

# import the data set
from sklearn import datasets
boston = datasets.load_boston()     # 返回一个类似于字典的类
X = boston.data
y = boston.target
features = boston.feature_names
boston_data = pd.DataFrame(X,columns=features)
boston_data["Price"] = y

# apply forward selection
forward_select(data=boston_data,target="Price")

aic is 3286.974956900157,continuing!
aic is 3171.5423142992013,continuing!
aic is 3114.0972674193326,continuing!
aic is 3097.359044862759,continuing!
aic is 3069.438633167217,continuing!
aic is 3057.9390497191152,continuing!
aic is 3048.438382711162,continuing!
aic is 3042.274993098419,continuing!
aic is 3040.154562175143,continuing!
aic is 3032.0687017003256,continuing!
aic is 3021.7263878250615,continuing!
for selection over!
final formula is Price~LSTAT+RM+PTRATIO+DIS+NOX+CHAS+B+ZN+CRIM+RAD+TAX


<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7fd8b2ae3940>

## Ridge Regression

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ridge_regression.html?highlight=rid#sklearn.linear_model.ridge_regression

In [10]:
from sklearn import linear_model
reg_rid = linear_model.Ridge(alpha=.5)
reg_rid.fit(X,y)
reg_rid.score(X,y)

0.739957023371629

## Lasso 

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html?highlight=lasso#sklearn.linear_model.Lasso

In [11]:
from sklearn import linear_model
reg_lasso = linear_model.Lasso(alpha = 0.5)
reg_lasso.fit(X,y)
reg_lasso.score(X,y)

0.7140164719858566

## Tuning hyperparameters

### Parameters

- internal to the model 

-  can be estimated from the data

- required by the model when making predictions

- often not set manually

- e.g. weights in neural network, support vectors in SVM, coefficients in regression

### Hyperparameters

- external to the model

- cannot be estimated from the data

- be used in processes to help estimate parameters

- often specified by the pratitioner

#### GridSearch CV

ref: https://towardsdatascience.com/grid-search-for-hyperparameter-tuning-9f63945e8fec

- loop through **predefined** hyperparameters and fit estimator on training set

- select the best parameters from the listed hyperparameters

- can specify the number of times for cross validation for each set of hyperparameters

API-GridSearchCV: https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html?highlight=gridsearchcv



In [12]:
# 我们先来对未调参的SVR进行评价： 
import numpy as np
from sklearn.svm import SVR     # 引入SVR类
from sklearn.pipeline import make_pipeline   # 引入管道简化学习流程
from sklearn.preprocessing import StandardScaler # 由于SVR基于距离计算，引入对数据进行标准化的类
from sklearn.model_selection import GridSearchCV  # 引入网格搜索调优
from sklearn.model_selection import cross_val_score # 引入K折交叉验证

pipe_SVR = make_pipeline(StandardScaler(),SVR())
# make_pipeline is a shorthand for Pipeline, no need for naming the estimators
score1 = cross_val_score(estimator=pipe_SVR, 
                            X = X,
                            y = y,
                            scoring = 'r2',
                            cv = 10)       # 10折交叉验证
print("CV accuracy: %.3f +/- %.3f" % ((np.mean(score1)),np.std(score1)))

# 下面我们使用网格搜索来对SVR调参：
from sklearn.pipeline import Pipeline
pipe_svr = Pipeline([("StandardScaler",StandardScaler()),
                        ("svr",SVR())])
# Pipeline() sequentially apply a list of transforms and a final estimator

param_range = [0.0001,0.001,0.01,0.1,1.0,10.0,100.0,1000.0]
param_grid = [{"svr__C":param_range,"svr__kernel":["linear"]},  # 注意__是指两个下划线，一个下划线会报错的
            {"svr__C":param_range,"svr__gamma":param_range,"svr__kernel":["rbf"]}]
gs = GridSearchCV(estimator=pipe_svr,
                  param_grid = param_grid,
                  scoring = 'r2',
                  cv = 10)       # 10折交叉验证
gs = gs.fit(X,y)
print("网格搜索最优得分：",gs.best_score_)
print("网格搜索最优参数组合：\n",gs.best_params_)
print("网格搜索最优参数：\n",gs.best_estimator_)

CV accuracy: 0.187 +/- 0.649
网格搜索最优得分： 0.6096834373642859
网格搜索最优参数组合：
 {'svr__C': 1000.0, 'svr__gamma': 0.001, 'svr__kernel': 'rbf'}
网格搜索最优参数：
 Pipeline(memory=None,
     steps=[('StandardScaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svr', SVR(C=1000.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma=0.001,
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False))])


**Pipeline()**: https://iaml.it/blog/optimizing-sklearn-pipelines

- workflows to execute a sequence of typical tasks

**param_grid**: specify the parameters in the form of dictionary according to the estimator

#### RandomizedSearch CV

ref: https://jamesrledoux.com/code/randomized_parameter_search

- take random draws from a predetermined set of hyperparameter distributions

- an advantage over grid search in that the algorithm searches over distributions of parameter values rather than predetermined lists of candidate values for each hyperparameter

In [14]:
# 下面我们使用随机搜索来对SVR调参：
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform  # 引入均匀分布设置参数
pipe_svr = Pipeline([("StandardScaler",StandardScaler()),
                     ("svr",SVR())])
distributions = dict(svr__C=uniform(loc=1.0, scale=4),    # 构建连续参数的分布
                     svr__kernel=["linear","rbf"],                                   # 离散参数的集合
                    svr__gamma=uniform(loc=0, scale=4))

rs = RandomizedSearchCV(estimator=pipe_svr,
                        param_distributions = distributions,
                        scoring = 'r2',
                        cv = 10)       # 10折交叉验证
rs = rs.fit(X,y)
print("随机搜索最优得分：",rs.best_score_)
print("随机搜索最优参数组合：\n",rs.best_params_)
print("网格搜索最优参数：\n",rs.best_estimator_)

随机搜索最优得分： 0.3057686337827243
随机搜索最优参数组合：
 {'svr__C': 1.0789861205029299, 'svr__gamma': 3.7221397918916743, 'svr__kernel': 'linear'}
网格搜索最优参数：
 Pipeline(memory=None,
     steps=[('StandardScaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svr', SVR(C=1.0789861205029299, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
  gamma=3.7221397918916743, kernel='linear', max_iter=-1, shrinking=True,
  tol=0.001, verbose=False))])
