# 监督式机器学习应用于股票择时

使用Supervised Learning 进行股票价格方向的预测。

首先，加载股票数据，在本文中，使用平安银行000001.SZ作为测试股票，使用上证指数000001.SH作为基准数据。

预测目标是：

在给定的环境下（由很多列属性数据来构造），

属性数据包括：

    Adj close: 复权后的收盘价
    Daily volumn: 日成交量
    2-day net price change: 相邻两天价差的百分比
    10-day standard deviation: 10日股价std
    10-day moving average: 10日股价移动均值
    50-day standard deviation: 50日股价std
    50-day moving average: 50日股价移动均值
    10-day rolling beta against baseline: 10日对基准的移动beta
    50-day rolling beta against baseline: 50日对基准的移动beta

时间区间是
2015.01.01 - 2017.01.01

使用时间区内的日线数据


In [50]:
import tquotes.tquotes as tq
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import scale, StandardScaler
import tquotes.algos as algos
from datetime import datetime
import bisect

In [4]:
# 加载数据
startDate = '20150101'
endDate = '20170101'

stk = '000001'
indexCode = '000001'

dayData = tq.load_day_data_between(stk, startDate, endDate)
# 将日线数据切片到规定的日期之内
dayIndexData = tq.load_day_index_between(indexCode, startDate, endDate)

In [5]:
print(dayData.head())
print(dayData.tail())

                  tm   open   high    low  close   volumn     in_money  \
2015-01-05  20150105  15.99  16.27  15.61  16.02  2860436  12323532.84   
2015-01-06  20150106  15.85  16.38  15.56  15.78  2166421   9615361.68   
2015-01-07  20150107  15.56  15.82  15.30  15.48  1700121   6881415.08   
2015-01-08  20150108  15.50  15.57  14.92  14.96  1407714   4504305.38   
2015-01-09  20150109  14.90  15.86  14.71  15.08  2508500  10868493.67   

              out_money       amount  prev_close  
2015-01-05  13713257.98  45652143.71       15.84  
2015-01-06  11596141.69  34534541.49       16.02  
2015-01-07   8349741.86  26350491.88       15.78  
2015-01-08   6885389.53  21283735.97       15.48  
2015-01-09  10977883.54  38353577.86       14.96  
                  tm  open  high   low  close  volumn   in_money   out_money  \
2016-12-27  20161227  9.12  9.13  9.07   9.08  268717  529887.07   740479.00   
2016-12-28  20161228  9.08  9.11  9.04   9.06  335963  475345.22   662408.89   
2016-12-2

In [6]:
dayIndexData.head()

Unnamed: 0,tm,open,high,low,close,volumn
2015-01-05,2015-01-05,3258.62,3369.01,3253.88,3350.51,54471306
2015-01-06,2015-01-06,3330.79,3393.86,3303.59,3351.44,52735197
2015-01-07,2015-01-07,3326.64,3374.75,3312.21,3373.95,43472524
2015-01-08,2015-01-08,3371.95,3380.44,3285.6,3293.45,39921659
2015-01-09,2015-01-09,3276.96,3404.42,3267.8,3285.41,45695625


In [9]:
print('stock data', dayData.shape)
print('index data', dayIndexData.shape)

stock data (489, 10)
index data (489, 6)


In [7]:
dayIndexData.tail()

Unnamed: 0,tm,open,high,low,close,volumn
2016-12-27,2016-12-27,3117.38,3127.71,3113.74,3114.66,16217960
2016-12-28,2016-12-28,3113.76,3118.78,3094.54,3102.23,15431549
2016-12-29,2016-12-29,3095.84,3111.79,3087.56,3096.09,14990488
2016-12-30,2016-12-30,3097.34,3108.8,3089.99,3103.63,15171833
2017-01-03,2017-01-03,3105.3,3136.45,3105.3,3135.92,15987288


In [34]:
def ols_data(y, x, window=10):
    """
    实现ols 功能
    """
    yArr = y.values
    xArr = x.values
    
    ratios = []
    for i in range(2, x.shape[0]):
        starti = i - window
        starti = 0 if starti < 0 else 0
        x_piece = xArr[starti: i]
        y_piece = yArr[starti: i]
        
        lr = LinearRegression()
        lr.fit(x_piece.reshape(-1, 1), y_piece)
        ratios.append(lr.coef_[0])
        
    ratios.insert(0, 0)
    ratios.insert(0, 0)
    s = pd.Series(ratios, index=x.index)
    return  s.fillna(-99)

In [46]:
# 构造预测属性
attrData = dayData[['tm', 'close', 'prev_close', 'amount']]
attrIndexData = dayIndexData[['tm', 'close']]
# Adj. close ；close属性就是复权后的价格
# Daily volumn: 使用交易额amount代替，以免复权的影响

# 2-day price pct_change
attrData['pct_change'] = (attrData['close'] - attrData['prev_close']) / (attrData['prev_close']) * 100

# 10-day std
attrData['std_10'] = attrData['close'].rolling(window=10, min_periods=0).std()
# 50-day std
attrData['std_50'] = attrData['close'].rolling(window=50, min_periods=0).std()
# 10-day ma
attrData['ma_10'] = attrData['close'].rolling(window=10, min_periods=0).mean()
# 50-day ma
attrData['ma_50'] = attrData['close'].rolling(window=50, min_periods=0).mean()


# 10-day rolling beta against baseline；用10天的数据做回归，取自变量x的系数
# pandas 已经不再支持ols功能，只能自己实现了。

# ols_ret_10 = pd.ols(y=attrData['pct_change'], x=attrIndexData['close'].pct_change(), window=10, window_type='rolling')
ols_ret_10 = ols_data(attrData['pct_change'].fillna(-99), attrIndexData['close'].pct_change().fillna(-99), window=10)
attrData['beta_10'] = ols_ret_10
# 50-day rolling beta against baseline; 用50天的数据做回归，去自变量x的系数
# ols_ret_50 = pd.ols(y=attrData['pct_change'], x=attrIndexData['close'].pct_change(), window=50, window_type='rolling')
ols_ret_50 = ols_data(attrData['pct_change'].fillna(-99), attrIndexData['close'].pct_change().fillna(-99), window=50)
attrData['beta_50'] = ols_ret_50

# 目标数据构造；目标值是30天之后的收盘价是否大于当前的收盘价
# 如果是：表示是处于牛市趋势中，bull 用1表示
# 如果不是：表示是处在熊市趋势中，bear  用0表示
y = np.where((attrData['close'].shift(-30) - attrData['close']) > 0, 1, 0)

attrData['y'] = y

attrData = attrData.dropna(axis=0)
print(attrData.shape)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.

(488, 12)


In [47]:
attrData.head()

Unnamed: 0,tm,close,prev_close,amount,pct_change,std_10,std_50,ma_10,ma_50,beta_10,beta_50,y
2015-01-06,20150106,15.78,16.02,34534541.49,-1.498127,0.169706,0.169706,15.9,15.9,0.0,0.0,0
2015-01-07,20150107,15.48,15.78,26350491.88,-1.901141,0.270555,0.270555,15.76,15.76,-0.026611,-0.026611,0
2015-01-08,20150108,14.96,15.48,21283735.97,-3.359173,0.456946,0.456946,15.56,15.56,-0.028646,-0.028646,0
2015-01-09,20150109,15.08,14.96,38353577.86,0.802139,0.4502,0.4502,15.464,15.464,-0.034232,-0.034232,0
2015-01-12,20150112,14.77,15.08,22932169.31,-2.055703,0.492358,0.492358,15.348333,15.348333,-0.026516,-0.026516,0


In [48]:
print(attrData['y'].value_counts())

0    292
1    196
Name: y, dtype: int64


In [57]:
# 构造训练数据和测试数据
X = attrData[['close', 'amount', 'pct_change', 'std_10', 'std_50', 'ma_10', 'ma_50', 'beta_10', 'beta_50']].values
norm_scaler = StandardScaler()
X_norm = norm_scaler.fit_transform(X)

y = attrData['y'].values

X_train, X_test, y_train, y_test = train_test_split(X_norm, y, test_size=0.1)

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print('logreg score {:.3f}'.format(logreg.score(X_test, y_test)*100))

logreg score 69.388


## 预测结论

从上面的预测实验中，我们可以看出，如果运行多次训练和测试过程，发现预测的效果差异性较大，并不稳定。

可能是Logistic模型的问题，可以换用其他类型的模型来提高准确度。

下面设法将其运用到回测的过程中。