交叉验证(Cross-validation)主要用于建模应用中，例如PCR 、PLS 回归建模中。在给定的建模样本中，拿出大部分样本进行建模型，留小部分样本用刚建立的模型进行预报，并求这小部分样本的预报误差，记录它们的平方加和。
它的基本思想就是将原始数据（dataset）进行分组，一部分做为训练集（这里的训练集通常包含：训练集和验证集两部分）来训练模型，另一部分做为测试集来评价模型。
- 训练集 (Traning Set)：用于训练模型；
- 验证集 (Validation Set)：用于模型的参数选择配置；
- 测试集 (Test Set)：用于评估模型的泛化能力。


In [1]:
import pandas as pd
import numpy as np

data = pd.read_excel("Concrete_Data.xls")
len(data)

1030

In [2]:
req_col_names = ["Cement", "BlastFurnaceSlag", "FlyAsh", "Water", "Superplasticizer",
                 "CoarseAggregate", "FineAggregate", "Age", "CC_Strength"]
curr_col_names = list(data.columns)

mapper = {}
for i, name in enumerate(curr_col_names):
    mapper[name] = req_col_names[i]

data = data.rename(columns=mapper)

In [3]:
data.head()

Unnamed: 0,Cement,BlastFurnaceSlag,FlyAsh,Water,Superplasticizer,CoarseAggregate,FineAggregate,Age,CC_Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.887366
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.269535
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05278
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.296075


In [4]:
X = data.iloc[:,:-1]         # Features - All columns but last
y = data.iloc[:,-1]          # Target - Last Column

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2021)

In [6]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [7]:
# Importing models
from sklearn.linear_model import LinearRegression
# Linear Regression
lr = LinearRegression()
# Fitting models on Training data
lr.fit(X_train, y_train)
# Making predictions on Test data
y_pred_lr = lr.predict(X_test)

In [8]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
print("Model\t\t\t RMSE \t\t MSE \t\t MAE \t\t R2")
print("""LinearRegression \t {:.2f} \t\t {:.2f} \t{:.2f} \t\t{:.6f}""".format(
            np.sqrt(mean_squared_error(y_test, y_pred_lr)),mean_squared_error(y_test, y_pred_lr),
            mean_absolute_error(y_test, y_pred_lr), r2_score(y_test, y_pred_lr)))

Model			 RMSE 		 MSE 		 MAE 		 R2
LinearRegression 	 11.17 		 124.81 	8.79 		0.548376


In [9]:
from sklearn.model_selection import ShuffleSplit,cross_val_score

n_samples = data.shape[0]

cv = ShuffleSplit(n_splits=5, test_size=.2, random_state=0) # 这里修改 n 折交叉数量

i = 1
print(" Order \t\t\t  RMSE \t\t MSE\t\t MAE \t\t  R2")

for train_index, test_index in cv.split(X, y):

    X_train, X_test = X.iloc[train_index.tolist()], X.iloc[test_index.tolist()]
    y_train, y_test = y.iloc[train_index.tolist()], y.iloc[test_index.tolist()]

    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred_lr = model.predict(X_test)
    print("{:2d} of {} folds\t\t{:.2f}\t\t{:.2f}\t\t{:.2f}\t\t{:.6f}".format(
        i,
        cv.n_splits,
        np.sqrt(mean_squared_error(y_test, y_pred_lr)),
        mean_squared_error(y_test, y_pred_lr),
        mean_absolute_error(y_test, y_pred_lr),
        r2_score(y_test, y_pred_lr)
    ))
    i += 1

 Order 			  RMSE 		 MSE		 MAE 		  R2
 1 of 5 folds		9.78		95.64		7.87		0.636898
 2 of 5 folds		10.46		109.40		8.26		0.592934
 3 of 5 folds		11.20		125.40		8.87		0.610407
 4 of 5 folds		10.20		104.09		8.12		0.634528
 5 of 5 folds		9.92		98.31		7.96		0.651915


In [10]:
scores = cross_val_score(lr,X, y,cv=cv,scoring='r2')
print("Mean cross validation score: {:6f}".format(scores.mean()))

Mean cross validation score:0.625337


**结论总结**

- 不采用交叉验证的线性回归问题的 R2-score：0.548376

- 采用分层交叉验证（K = 5）的平均 R2-score：0.625337

决定系数（R-square）正确率提高了 14.03 %，说明交叉验证模型优化效果显著。
