这篇研报**《日内分域信息中的Alpha—高频研究系列九》**由兴证金工团队发布，重点聚焦日内分域信息对高频因子的影响及其构建过程。报告的核心在于通过对日内交易的时间、股价和成交量维度进行分域分析，构建多种Alpha因子，揭示出其中的选股能力和特异性。

几个关键要点：

1. **日内分域信息构建**：以时间、价格和交易活跃度三个维度对日内数据进行分域，通过这些特征来刻画Alpha信息。基础型因子构建基于开盘和尾盘量价差异，显示出良好的股价预测能力。
   
2. **显著性因子**：通过显著性理论，结合个股的量价特征，构建了自身显著性因子和“同伴”显著性因子。两者都表现出较高的预测能力，尤其是从股票间成对相关性角度构建的因子具备极强的特异性。

3. **因子表现**：无论是基础型因子还是显著性因子，多空组合测试的夏普比率都大于3，部分因子的多空收益率接近35%，表现出较为显著的Alpha捕捉能力。

4. **风险提示**：该模型基于历史数据，市场环境变化时可能失效，建议投资者审慎考虑该模型在不同市场环境下的表现。

这份报告提供了有关如何基于分域信息从日内数据中挖掘Alpha的详细方法，适用于高频量化策略的开发。

要复现这份研报的框架，我们可以分成几个关键部分：数据处理、因子构建、回测分析、风险提示等。以下是一个完整的框架，其中每个部分都对应研报中的相应模块。

1. 数据准备与预处理

首先要处理分钟级别的高频数据，这些数据将用来构建因子。

In [32]:
!pip install pyarrow



In [33]:
import pandas as pd
import os

# 指定包含Feather文件的目录路径
directory_path = '/Users/zhangrui/Desktop/励京资本/A股分钟'

# 获取目录下所有.feather文件
file_paths = [os.path.join(directory_path, file) for file in os.listdir(directory_path) if file.endswith('.feather')]

# 读取数据并整合
all_data = []

for file_path in file_paths:
    # 读取每个feather文件
    df = pd.read_feather(file_path)
    all_data.append(df)

# 将所有数据拼接成一个完整的DataFrame
data = pd.concat(all_data, ignore_index=True)

# 数据预处理：处理缺失值、去噪声等
data['volume'].fillna(0, inplace=True)
data['close'].fillna(method='ffill', inplace=True)



In [34]:
data

Unnamed: 0,date,stkcd,open,high,low,close,volume,money
0,2022-12-01 09:31:00,000004.XSHE,75.87,76.32,75.8,76.25,25947.0,1972620.0
1,2022-12-01 09:32:00,000004.XSHE,76.25,76.25,75.95,75.95,5721.0,435038.0
2,2022-12-01 09:33:00,000004.XSHE,76.1,76.1,75.87,76.02,7763.0,590132.0
3,2022-12-01 09:34:00,000004.XSHE,76.02,76.17,75.95,76.1,11107.0,844782.0
4,2022-12-01 09:35:00,000004.XSHE,76.32,76.54,76.17,76.54,15324.0,1170280.0
...,...,...,...,...,...,...,...,...
1008235,2024-08-22 14:56:00,000006.XSHE,199.33,199.84,199.33,199.33,2492.0,497302.0
1008236,2024-08-22 14:57:00,000006.XSHE,199.33,199.84,199.33,199.33,2266.0,452122.0
1008237,2024-08-22 14:58:00,000006.XSHE,199.84,199.84,199.84,199.84,112.0,22407.0
1008238,2024-08-22 14:59:00,000006.XSHE,199.84,199.84,199.84,199.84,0.0,0.0


In [35]:
# 分域数据准备，比如将数据按时间段分域
# Convert 'date' to a datetime object
data['date'] = pd.to_datetime(data['date'])

# Define morning and afternoon sessions
def segment_trading_session(row):
    hour = row['date'].hour
    minute = row['date'].minute
    if hour < 12:
        return 'morning'
    else:
        return 'afternoon'

# Apply the session segmentation
data['session'] = data.apply(segment_trading_session, axis=1)

In [36]:
data

Unnamed: 0,date,stkcd,open,high,low,close,volume,money,session
0,2022-12-01 09:31:00,000004.XSHE,75.87,76.32,75.8,76.25,25947.0,1972620.0,morning
1,2022-12-01 09:32:00,000004.XSHE,76.25,76.25,75.95,75.95,5721.0,435038.0,morning
2,2022-12-01 09:33:00,000004.XSHE,76.1,76.1,75.87,76.02,7763.0,590132.0,morning
3,2022-12-01 09:34:00,000004.XSHE,76.02,76.17,75.95,76.1,11107.0,844782.0,morning
4,2022-12-01 09:35:00,000004.XSHE,76.32,76.54,76.17,76.54,15324.0,1170280.0,morning
...,...,...,...,...,...,...,...,...,...
1008235,2024-08-22 14:56:00,000006.XSHE,199.33,199.84,199.33,199.33,2492.0,497302.0,afternoon
1008236,2024-08-22 14:57:00,000006.XSHE,199.33,199.84,199.33,199.33,2266.0,452122.0,afternoon
1008237,2024-08-22 14:58:00,000006.XSHE,199.84,199.84,199.84,199.84,112.0,22407.0,afternoon
1008238,2024-08-22 14:59:00,000006.XSHE,199.84,199.84,199.84,199.84,0.0,0.0,afternoon


In [37]:
# Extract the hour from the 'date' column for segmentation
data['hour'] = data['date'].dt.hour

# Define hourly segments (e.g., 9:30-10:30, 10:30-11:30, etc.)
def hourly_segment(row):
    hour = row['hour']
    if hour == 9:
        return '9:30-10:30'
    elif hour == 10:
        return '10:30-11:30'
    elif hour == 11:
        return '11:30-12:30'
    elif hour == 13:
        return '13:00-14:00'
    else:
        return '14:00-15:00'

# Apply the hourly segmentation
data['hour_segment'] = data.apply(hourly_segment, axis=1)

In [38]:
data

Unnamed: 0,date,stkcd,open,high,low,close,volume,money,session,hour,hour_segment
0,2022-12-01 09:31:00,000004.XSHE,75.87,76.32,75.8,76.25,25947.0,1972620.0,morning,9,9:30-10:30
1,2022-12-01 09:32:00,000004.XSHE,76.25,76.25,75.95,75.95,5721.0,435038.0,morning,9,9:30-10:30
2,2022-12-01 09:33:00,000004.XSHE,76.1,76.1,75.87,76.02,7763.0,590132.0,morning,9,9:30-10:30
3,2022-12-01 09:34:00,000004.XSHE,76.02,76.17,75.95,76.1,11107.0,844782.0,morning,9,9:30-10:30
4,2022-12-01 09:35:00,000004.XSHE,76.32,76.54,76.17,76.54,15324.0,1170280.0,morning,9,9:30-10:30
...,...,...,...,...,...,...,...,...,...,...,...
1008235,2024-08-22 14:56:00,000006.XSHE,199.33,199.84,199.33,199.33,2492.0,497302.0,afternoon,14,14:00-15:00
1008236,2024-08-22 14:57:00,000006.XSHE,199.33,199.84,199.33,199.33,2266.0,452122.0,afternoon,14,14:00-15:00
1008237,2024-08-22 14:58:00,000006.XSHE,199.84,199.84,199.84,199.84,112.0,22407.0,afternoon,14,14:00-15:00
1008238,2024-08-22 14:59:00,000006.XSHE,199.84,199.84,199.84,199.84,0.0,0.0,afternoon,14,14:00-15:00


In [39]:
# Calculate session or hourly price difference (open-close diff for each segment)
def calc_segment_price_diff(df):
    open_price = df.iloc[0]['open']
    close_price = df.iloc[-1]['close']
    return close_price - open_price

# Group by 'stkcd' and 'session' or 'hour_segment' and calculate the open-close difference for each group
price_diff = data.groupby(['stkcd', 'session']).apply(calc_segment_price_diff)

# Convert the result to a DataFrame and reset the index so it aligns with the original data
price_diff = price_diff.reset_index(name='price_diff')

# Merge the calculated 'price_diff' back into the original DataFrame
data = pd.merge(data, price_diff, on=['stkcd', 'session'], how='left')

In [40]:
data

Unnamed: 0,date,stkcd,open,high,low,close,volume,money,session,hour,hour_segment,price_diff
0,2022-12-01 09:31:00,000004.XSHE,75.87,76.32,75.8,76.25,25947.0,1972620.0,morning,9,9:30-10:30,3.21
1,2022-12-01 09:32:00,000004.XSHE,76.25,76.25,75.95,75.95,5721.0,435038.0,morning,9,9:30-10:30,3.21
2,2022-12-01 09:33:00,000004.XSHE,76.1,76.1,75.87,76.02,7763.0,590132.0,morning,9,9:30-10:30,3.21
3,2022-12-01 09:34:00,000004.XSHE,76.02,76.17,75.95,76.1,11107.0,844782.0,morning,9,9:30-10:30,3.21
4,2022-12-01 09:35:00,000004.XSHE,76.32,76.54,76.17,76.54,15324.0,1170280.0,morning,9,9:30-10:30,3.21
...,...,...,...,...,...,...,...,...,...,...,...,...
1008235,2024-08-22 14:56:00,000006.XSHE,199.33,199.84,199.33,199.33,2492.0,497302.0,afternoon,14,14:00-15:00,-125.52
1008236,2024-08-22 14:57:00,000006.XSHE,199.33,199.84,199.33,199.33,2266.0,452122.0,afternoon,14,14:00-15:00,-125.52
1008237,2024-08-22 14:58:00,000006.XSHE,199.84,199.84,199.84,199.84,112.0,22407.0,afternoon,14,14:00-15:00,-125.52
1008238,2024-08-22 14:59:00,000006.XSHE,199.84,199.84,199.84,199.84,0.0,0.0,afternoon,14,14:00-15:00,-125.52


In [41]:
# Assuming 'data' is your DataFrame, save it to your desktop
save_path = '/Users/zhangrui/Desktop/data_modified001.csv'

# Save the DataFrame as a CSV file
data.to_csv(save_path, index=False)

In [111]:
merged_data

Unnamed: 0,date,stkcd,open,high,low,close,volume,money,session,hour,...,high_rank,low_rank,highRank_sum,lowRank_sum,high_dt,low_dt,diff_std,diff_vol,future_return,rank
0,2022-12-01,000001.XSHE,1657.91,1691.37,1656.67,1687.6500000000,160780.0,268431460.0000000000,morning,9,...,0.994792,0.978125,0.994792,0.978125,14,477,0.044682,0.0,0,801120.5
1,2022-12-01,000001.XSHE,1657.91,1691.37,1656.67,1687.6500000000,160780.0,268431460.0000000000,morning,9,...,0.994792,0.978125,1.989583,1.956250,14,477,0.044682,0.0,-0.0117500666607412674428939650,801120.5
2,2022-12-01,000001.XSHE,1685.17,1692.61,1666.59,1667.8200000000,65530.0,109918445.0000000000,morning,9,...,0.998958,0.996875,2.988542,2.953125,14,477,0.044682,0.0,0,801120.5
3,2022-12-01,000001.XSHE,1685.17,1692.61,1666.59,1667.8200000000,65530.0,109918445.0000000000,morning,9,...,0.998958,0.996875,3.987500,3.950000,14,477,0.044682,0.0,-0.0007374896571572471849480160,801120.5
4,2022-12-01,000001.XSHE,1667.82,1670.30,1666.59,1666.5900000000,42739.0,71314814.0000000000,morning,9,...,0.990625,0.996875,4.978125,4.946875,14,477,0.044682,0.0,0,801120.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1008235,2024-08-22,000010.XSHE,22.47,22.47,22.33,22.33,16979.0,379664.0,afternoon,14,...,0.120833,0.058333,1.812500,1.029167,33,231,0.029615,0.0,0.0,608760.5
1008236,2024-08-22,000010.XSHE,22.47,22.47,22.33,22.33,8756.0,196265.0,afternoon,14,...,0.120833,0.058333,1.812500,1.029167,33,231,0.029615,0.0,0.0,608760.5
1008237,2024-08-22,000010.XSHE,22.33,22.33,22.33,22.33,0.0,0.0,afternoon,14,...,0.008333,0.058333,1.700000,1.029167,33,231,0.029615,0.0,0.0,608760.5
1008238,2024-08-22,000010.XSHE,22.33,22.33,22.33,22.33,0.0,0.0,afternoon,14,...,0.008333,0.058333,1.587500,1.029167,33,231,0.029615,0.0,0.00627,608760.5


In [112]:
print(merged_data.columns())

TypeError: 'Index' object is not callable

2. 日内基础因子构建

在这一部分，构建基础型因子，依照开盘、尾盘的量价差异等特征。

我们需要构建三个基础型因子，分别是开盘和尾盘半小时之间的差异：

	1.	lh_rtnDiff: 涨跌幅比值
	2.	lh_volDiff: 成交量之和比值
	3.	lh_stdDiff: 波动率比值

下面是如何根据这些因子构建方法进行代码实现的步骤：

1. 构建因子的代码

首先，我们需要将每个交易日划分为开盘的前半小时和尾盘的后半小时。

步骤：

	1.	开盘前半小时： 9:30 至 10:00
	2.	尾盘后半小时： 14:30 至 15:00

In [42]:
import numpy as np
import pandas as pd

# 确保 'date' 列是时间格式
data['date'] = pd.to_datetime(data['date'])

# 定义开盘和尾盘半小时的时间范围
def label_time_period(row):
    hour, minute = row['date'].hour, row['date'].minute
    if (hour == 9 and minute >= 30) or (hour == 10 and minute == 0):
        return 'morning_half_hour'
    elif (hour == 14 and minute >= 30) or (hour == 15 and minute == 0):
        return 'afternoon_half_hour'
    else:
        return 'other'

# 为每行数据打上时间段标签（开盘前半小时、尾盘后半小时等）
data['time_period'] = data.apply(label_time_period, axis=1)

# 将时间戳向下取整到分钟，确保时间一致性
data['date'] = data['date'].dt.floor('T')

# 分别计算开盘和尾盘半小时的聚合数据
morning_data = data[data['time_period'] == 'morning_half_hour'].groupby(['stkcd', 'date']).agg({
    'open': 'first',  # 开盘价
    'close': 'last',  # 收盘价
    'volume': 'sum',  # 成交量总和
    'high': 'max',    # 最高价
    'low': 'min'      # 最低价
}).reset_index()

afternoon_data = data[data['time_period'] == 'afternoon_half_hour'].groupby(['stkcd', 'date']).agg({
    'open': 'first',  # 开盘价
    'close': 'last',  # 收盘价
    'volume': 'sum',  # 成交量总和
    'high': 'max',    # 最高价
    'low': 'min'      # 最低价
}).reset_index()

# 合并早盘和尾盘数据，确保它们能够正确匹配
merged_data = pd.merge(morning_data, afternoon_data, on=['stkcd', 'date'], suffixes=('_morning', '_afternoon'))

# 构建因子：
# 1. lh_rtnDiff: 涨跌幅比值 = (尾盘收盘价 - 尾盘开盘价) / (开盘收盘价 - 开盘开盘价)
merged_data['rtn_morning'] = (merged_data['close_morning'] - merged_data['open_morning']) / merged_data['open_morning']
merged_data['rtn_afternoon'] = (merged_data['close_afternoon'] - merged_data['open_afternoon']) / merged_data['open_afternoon']

# 使用 np.divide 处理除零问题
merged_data['lh_rtnDiff'] = np.divide(merged_data['rtn_afternoon'], merged_data['rtn_morning'], where=merged_data['rtn_morning'] != 0)

# 2. lh_volDiff: 成交量比值 = 尾盘成交量 / 开盘成交量
merged_data['lh_volDiff'] = np.divide(merged_data['volume_afternoon'], merged_data['volume_morning'], where=merged_data['volume_morning'] != 0)

# 3. lh_stdDiff: 波动率比值 = 尾盘波动率 / 开盘波动率
merged_data['std_morning'] = (merged_data['high_morning'] - merged_data['low_morning']) / merged_data['open_morning']
merged_data['std_afternoon'] = (merged_data['high_afternoon'] - merged_data['low_afternoon']) / merged_data['open_afternoon']

merged_data['lh_stdDiff'] = np.divide(merged_data['std_afternoon'], merged_data['std_morning'], where=merged_data['std_morning'] != 0)

# 最终输出因子数据
factor_data = merged_data[['stkcd', 'date', 'lh_rtnDiff', 'lh_volDiff', 'lh_stdDiff']]

# 显示前几行结果
print(factor_data.head())

# 如果需要保存结果到文件，例如Mac桌面：
factor_data.to_csv('/Users/zhangrui/Desktop/factor_data.csv', index=False)

Empty DataFrame
Columns: [stkcd, date, lh_rtnDiff, lh_volDiff, lh_stdDiff]
Index: []


In [52]:
import numpy as np
import pandas as pd

# 确保 'date' 列是时间格式
data['date'] = pd.to_datetime(data['date'])

# 定义开盘和尾盘半小时的时间范围
def label_time_period(row):
    hour, minute = row['date'].hour, row['date'].minute
    if (hour == 9 and minute >= 30) or (hour == 10 and minute == 0):
        return 'morning_half_hour'
    elif (hour == 14 and minute >= 30) or (hour == 15 and minute == 0):
        return 'afternoon_half_hour'
    else:
        return 'other'

# 为每行数据打上时间段标签（开盘前半小时、尾盘后半小时等）
data['time_period'] = data.apply(label_time_period, axis=1)

# 将时间戳向下取整到分钟，确保时间一致性
data['date'] = data['date'].dt.floor('T')

# 分别计算开盘和尾盘半小时的聚合数据
morning_data = data[data['time_period'] == 'morning_half_hour'].groupby(['stkcd', data['date'].dt.date]).agg({
    'open': 'first',  # 开盘价
    'close': 'last',  # 收盘价
    'volume': 'sum',  # 成交量总和
    'high': 'max',    # 最高价
    'low': 'min'      # 最低价
}).reset_index()

afternoon_data = data[data['time_period'] == 'afternoon_half_hour'].groupby(['stkcd', data['date'].dt.date]).agg({
    'open': 'first',  # 开盘价
    'close': 'last',  # 收盘价
    'volume': 'sum',  # 成交量总和
    'high': 'max',    # 最高价
    'low': 'min'      # 最低价
}).reset_index()

# 合并早盘和尾盘数据，确保它们能够正确匹配
merged_data = pd.merge(morning_data, afternoon_data, on=['stkcd', 'date'], suffixes=('_morning', '_afternoon'))

# 构建因子：
# 1. lh_rtnDiff: 涨跌幅比值 = (尾盘收盘价 - 尾盘开盘价) / (开盘收盘价 - 开盘开盘价)
merged_data['rtn_morning'] = (merged_data['close_morning'] - merged_data['open_morning']) / merged_data['open_morning']
merged_data['rtn_afternoon'] = (merged_data['close_afternoon'] - merged_data['open_afternoon']) / merged_data['open_afternoon']

# 使用 np.divide 处理除零问题
merged_data['lh_rtnDiff'] = np.divide(merged_data['rtn_afternoon'], merged_data['rtn_morning'], where=merged_data['rtn_morning'] != 0)

# 2. lh_volDiff: 成交量比值 = 尾盘成交量 / 开盘成交量
merged_data['lh_volDiff'] = np.divide(merged_data['volume_afternoon'], merged_data['volume_morning'], where=merged_data['volume_morning'] != 0)

# 3. lh_stdDiff: 波动率比值 = 尾盘波动率 / 开盘波动率
merged_data['std_morning'] = (merged_data['high_morning'] - merged_data['low_morning']) / merged_data['open_morning']
merged_data['std_afternoon'] = (merged_data['high_afternoon'] - merged_data['low_afternoon']) / merged_data['open_afternoon']

merged_data['lh_stdDiff'] = np.divide(merged_data['std_afternoon'], merged_data['std_morning'], where=merged_data['std_morning'] != 0)

# 去掉含有NaN值的行
factor_data = merged_data[['stkcd', 'date', 'lh_rtnDiff', 'lh_volDiff', 'lh_stdDiff']].dropna()

# 显示前几行结果
print(factor_data.head())

         stkcd        date                       lh_rtnDiff  \
0  000001.XSHE  2022-12-01   0.2260276338195973465050902702   
1  000001.XSHE  2022-12-02  0.07802675724652283262577831569   
2  000001.XSHE  2022-12-05                            0E+30   
3  000001.XSHE  2022-12-06   0.4281048803387075540684722467   
4  000001.XSHE  2022-12-07  -0.3176339952421322670229882376   

                       lh_volDiff                       lh_stdDiff  
0  0.2331566256245484503272496700   0.1536390853747409429132968358  
1  0.1495969549662004764458196714  0.08825247980920375088080654777  
2  0.5374809320472034614359515263   0.2105431795858279604178798930  
3  0.2998128677179968576477686543   0.2576206359560364042181956882  
4  0.7028859121652453430950039040   0.3560220018840977099504458495  


In [51]:
data

Unnamed: 0,date,stkcd,open,high,low,close,volume,money,session,hour,hour_segment,price_diff,time_period
0,2022-12-01 09:31:00,000004.XSHE,75.87,76.32,75.8,76.25,25947.0,1972620.0,morning,9,9:30-10:30,3.21,morning_half_hour
1,2022-12-01 09:32:00,000004.XSHE,76.25,76.25,75.95,75.95,5721.0,435038.0,morning,9,9:30-10:30,3.21,morning_half_hour
2,2022-12-01 09:33:00,000004.XSHE,76.1,76.1,75.87,76.02,7763.0,590132.0,morning,9,9:30-10:30,3.21,morning_half_hour
3,2022-12-01 09:34:00,000004.XSHE,76.02,76.17,75.95,76.1,11107.0,844782.0,morning,9,9:30-10:30,3.21,morning_half_hour
4,2022-12-01 09:35:00,000004.XSHE,76.32,76.54,76.17,76.54,15324.0,1170280.0,morning,9,9:30-10:30,3.21,morning_half_hour
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1008235,2024-08-22 14:56:00,000006.XSHE,199.33,199.84,199.33,199.33,2492.0,497302.0,afternoon,14,14:00-15:00,-125.52,afternoon_half_hour
1008236,2024-08-22 14:57:00,000006.XSHE,199.33,199.84,199.33,199.33,2266.0,452122.0,afternoon,14,14:00-15:00,-125.52,afternoon_half_hour
1008237,2024-08-22 14:58:00,000006.XSHE,199.84,199.84,199.84,199.84,112.0,22407.0,afternoon,14,14:00-15:00,-125.52,afternoon_half_hour
1008238,2024-08-22 14:59:00,000006.XSHE,199.84,199.84,199.84,199.84,0.0,0.0,afternoon,14,14:00-15:00,-125.52,afternoon_half_hour


2. 解释

	•	时间段标签：我们首先对每一行数据进行标签化，标记其属于开盘前半小时或尾盘后半小时。
	•	开盘和尾盘数据计算：分别对这两个时间段的交易数据进行聚合计算，得到每只股票在这两个时间段的开盘价、收盘价、最高价、最低价和成交量等信息。
	•	因子计算：
	•	lh_rtnDiff：基于开盘和尾盘的涨跌幅差异。
	•	lh_volDiff：基于开盘和尾盘的成交量差异。
	•	lh_stdDiff：基于开盘和尾盘的波动率差异。

3. 保存结果

最后将计算好的因子数据保存到文件中：

In [43]:
# 保存结果到Mac桌面
factor_data.to_csv('/Users/zhangrui/Desktop/factor_data.csv', index=False)

### 首先我们需要构建 Rank IC 测试和组合回测，具体步骤如下：

1. Rank IC 测试

Rank IC 是指因子的值和未来收益率之间的秩相关系数（通常为 Spearman 相关系数），我们可以通过以下步骤计算因子的 Rank IC。

2. 多空组合回测

我们根据因子的值进行排序，构建多空组合（买入因子值高的股票，卖出因子值低的股票），并计算多空组合的收益率、夏普比率、年化波动率和最大回撤等指标。

In [60]:
import pandas as pd
import numpy as np
from scipy.stats import spearmanr

# 使用现有的 data 计算收益率
data['date'] = pd.to_datetime(data['date'])  # 确保 data 中的 'date' 列是 datetime 格式
data = data.sort_values(['stkcd', 'date'])

# 计算每日收益率 (Return)
data['return'] = data.groupby('stkcd')['close'].pct_change()

# 读取 factor_data 并确保 'date' 列为 datetime 类型
factor_data['date'] = pd.to_datetime(factor_data['date'])  # 确保 factor_data 中的 'date' 列也是 datetime 格式

# 合并因子数据（factor_data）和收益率数据（data）
# 使用 'left' merge 方式以确保因子数据的完整性
merged_data = pd.merge(factor_data, data[['stkcd', 'date', 'return']], on=['stkcd', 'date'], how='left')

# 去掉合并后任何含有 NaN 值的行
merged_data = merged_data.dropna()

# Rank IC 测试
def rank_ic(factor_col, return_col):
    """计算 Rank IC，即因子值与未来收益率的 Spearman 相关系数"""
    ic_values = []
    for date, group in merged_data.groupby('date'):
        # 只在有超过1个数据点的情况下计算 Spearman 相关系数
        if len(group) > 1:
            ic_value, _ = spearmanr(group[factor_col], group[return_col])
            ic_values.append(ic_value)
    # 如果没有有效数据点，返回 NaN
    if len(ic_values) == 0:
        return np.nan, np.nan
    return np.mean(ic_values), np.std(ic_values)

# IC 计算
ic_lh_rtnDiff_mean, ic_lh_rtnDiff_std = rank_ic('lh_rtnDiff', 'return')
ic_lh_volDiff_mean, ic_lh_volDiff_std = rank_ic('lh_volDiff', 'return')
ic_lh_stdDiff_mean, ic_lh_stdDiff_std = rank_ic('lh_stdDiff', 'return')

# 计算 ICIR (IC mean / IC std)
icir_lh_rtnDiff = ic_lh_rtnDiff_mean / ic_lh_rtnDiff_std if ic_lh_rtnDiff_std != 0 else np.nan
icir_lh_volDiff = ic_lh_volDiff_mean / ic_lh_volDiff_std if ic_lh_volDiff_std != 0 else np.nan
icir_lh_stdDiff = ic_lh_stdDiff_mean / ic_lh_stdDiff_std if ic_lh_stdDiff_std != 0 else np.nan

# 显示结果
print(f"lh_rtnDiff IC: mean={ic_lh_rtnDiff_mean}, std={ic_lh_rtnDiff_std}, ICIR={icir_lh_rtnDiff}")
print(f"lh_volDiff IC: mean={ic_lh_volDiff_mean}, std={ic_lh_volDiff_std}, ICIR={icir_lh_volDiff}")
print(f"lh_stdDiff IC: mean={ic_lh_stdDiff_mean}, std={ic_lh_stdDiff_std}, ICIR={icir_lh_stdDiff}")

# 多空组合回测
def long_short_test(factor_col):
    """多空组合回测，买入因子值最高的30%股票，卖出因子值最低的30%股票"""
    results = []
    for date, group in merged_data.groupby('date'):
        # 确保分组内有足够的数据点
        if len(group) > 1:
            # 按因子值排序
            group = group.sort_values(by=factor_col)
            # 定义多头和空头组合
            long_portfolio = group.iloc[-int(len(group)*0.3):]
            short_portfolio = group.iloc[:int(len(group)*0.3)]
            
            # 计算多头、空头和多空组合收益率
            long_return = long_portfolio['return'].mean()
            short_return = short_portfolio['return'].mean()
            long_short_return = long_return - short_return
            
            results.append(long_short_return)
    
    # 如果没有有效数据点，返回 NaN
    if len(results) == 0:
        return np.nan, np.nan, np.nan, np.nan
    
    # 转换为 DataFrame 并计算年化收益率、波动率、夏普比率等指标
    results_df = pd.DataFrame(results, columns=['long_short_return'])
    annual_return = results_df['long_short_return'].mean() * 252
    annual_volatility = results_df['long_short_return'].std() * np.sqrt(252)
    sharpe_ratio = annual_return / annual_volatility if annual_volatility != 0 else np.nan
    max_drawdown = results_df['long_short_return'].min()
    
    return annual_return, annual_volatility, sharpe_ratio, max_drawdown

# 计算每个因子的多空组合表现
annual_return_rtnDiff, annual_vol_rtnDiff, sharpe_rtnDiff, max_drawdown_rtnDiff = long_short_test('lh_rtnDiff')
annual_return_volDiff, annual_vol_volDiff, sharpe_volDiff, max_drawdown_volDiff = long_short_test('lh_volDiff')
annual_return_stdDiff, annual_vol_stdDiff, sharpe_stdDiff, max_drawdown_stdDiff = long_short_test('lh_stdDiff')

# 显示多空组合测试结果
print(f"lh_rtnDiff: 年化收益率={annual_return_rtnDiff}, 年化波动率={annual_vol_rtnDiff}, 夏普比率={sharpe_rtnDiff}, 最大回撤={max_drawdown_rtnDiff}")
print(f"lh_volDiff: 年化收益率={annual_return_volDiff}, 年化波动率={annual_vol_volDiff}, 夏普比率={sharpe_volDiff}, 最大回撤={max_drawdown_volDiff}")
print(f"lh_stdDiff: 年化收益率={annual_return_stdDiff}, 年化波动率={annual_vol_stdDiff}, 夏普比率={sharpe_stdDiff}, 最大回撤={max_drawdown_stdDiff}")

lh_rtnDiff IC: mean=nan, std=nan, ICIR=nan
lh_volDiff IC: mean=nan, std=nan, ICIR=nan
lh_stdDiff IC: mean=nan, std=nan, ICIR=nan
lh_rtnDiff: 年化收益率=nan, 年化波动率=nan, 夏普比率=nan, 最大回撤=nan
lh_volDiff: 年化收益率=nan, 年化波动率=nan, 夏普比率=nan, 最大回撤=nan
lh_stdDiff: 年化收益率=nan, 年化波动率=nan, 夏普比率=nan, 最大回撤=nan


In [64]:
import pandas as pd
import numpy as np
from scipy.stats import spearmanr

# 确保 'date' 列格式一致，保留日期部分
data['date'] = pd.to_datetime(data['date']).dt.date  # 保留日期，不包括时间
factor_data['date'] = pd.to_datetime(factor_data['date']).dt.date  # 确保格式一致

# 检查 'date' 列的样本，确保格式一致
print("Data 中的 date 样本:")
print(data['date'].head(10))
print("Factor Data 中的 date 样本:")
print(factor_data['date'].head(10))

# 合并因子数据（factor_data）和收益率数据（data）
merged_data = pd.merge(factor_data, data[['stkcd', 'date', 'return']], on=['stkcd', 'date'], how='inner')

# 检查合并后的数据分布
print("Merged Data 分布:", merged_data[['lh_rtnDiff', 'lh_volDiff', 'lh_stdDiff', 'return']].describe())

# 检查是否有缺失值
print("Merged Data 缺失值检查:\n", merged_data.isna().sum())

# 检查合并数据的样本，确保 'date' 和 'stkcd' 对齐
print("Merged Data 样本:")
print(merged_data[['stkcd', 'date', 'lh_rtnDiff', 'return']].head(10))

# 去掉合并后任何含有 NaN 值的行
merged_data = merged_data.dropna()

# 后续步骤可以继续进行因子分析和回测

Data 中的 date 样本:
705600    2022-12-01
806400    2022-12-01
705601    2022-12-01
806401    2022-12-01
705602    2022-12-01
806402    2022-12-01
705603    2022-12-01
806403    2022-12-01
705604    2022-12-01
806404    2022-12-01
Name: date, dtype: object
Factor Data 中的 date 样本:
0    2022-12-01
1    2022-12-02
2    2022-12-05
3    2022-12-06
4    2022-12-07
5    2022-12-08
6    2022-12-09
7    2022-12-12
8    2022-12-13
9    2022-12-14
Name: date, dtype: object
Merged Data 分布:        lh_rtnDiff                      lh_volDiff lh_stdDiff  return
count      883920                          883920     883920  883913
unique       2499                            3289       3161   45716
top         0E+30  0.2331566256245484503272496700      0E+29       0
freq       195840                             480      23760  477441
Merged Data 缺失值检查:
 stkcd         0
date          0
lh_rtnDiff    0
lh_volDiff    0
lh_stdDiff    0
return        7
dtype: int64
Merged Data 样本:
         stkcd        date     

In [65]:
# 删除 `return` 列中为 NaN 的行
merged_data = merged_data.dropna(subset=['return'])

# 再次检查是否存在 NaN 值
print("Merged Data 缺失值检查:\n", merged_data.isna().sum())

# 处理零收益率的情况，可以选择删除或保留
# 如果希望删除零收益率的行（可选步骤），使用以下代码：
# merged_data = merged_data[merged_data['return'] != 0]

# 显示处理后的数据
print("Merged Data 样本（处理后）:")
print(merged_data[['stkcd', 'date', 'lh_rtnDiff', 'return']].head(10))

# 继续执行 Rank IC 测试和多空组合回测

Merged Data 缺失值检查:
 stkcd         0
date          0
lh_rtnDiff    0
lh_volDiff    0
lh_stdDiff    0
return        0
dtype: int64
Merged Data 样本（处理后）:
          stkcd        date                      lh_rtnDiff  \
1   000001.XSHE  2022-12-01  0.2260276338195973465050902702   
2   000001.XSHE  2022-12-01  0.2260276338195973465050902702   
3   000001.XSHE  2022-12-01  0.2260276338195973465050902702   
4   000001.XSHE  2022-12-01  0.2260276338195973465050902702   
5   000001.XSHE  2022-12-01  0.2260276338195973465050902702   
6   000001.XSHE  2022-12-01  0.2260276338195973465050902702   
7   000001.XSHE  2022-12-01  0.2260276338195973465050902702   
8   000001.XSHE  2022-12-01  0.2260276338195973465050902702   
9   000001.XSHE  2022-12-01  0.2260276338195973465050902702   
10  000001.XSHE  2022-12-01  0.2260276338195973465050902702   

                             return  
1                                 0  
2   -0.0117500666607412674428939650  
3                                 0  
4   

In [66]:
from scipy.stats import spearmanr
import pandas as pd
import numpy as np

# Rank IC 测试
def rank_ic(factor_col, return_col):
    """计算 Rank IC，即因子值与未来收益率的 Spearman 相关系数"""
    ic_values = []
    for date, group in merged_data.groupby('date'):
        # 只在有超过1个数据点的情况下计算 Spearman 相关系数
        if len(group) > 1:
            ic_value, _ = spearmanr(group[factor_col], group[return_col])
            ic_values.append(ic_value)
    # 如果没有有效数据点，返回 NaN
    if len(ic_values) == 0:
        return np.nan, np.nan
    return np.mean(ic_values), np.std(ic_values)

# IC 计算
ic_lh_rtnDiff_mean, ic_lh_rtnDiff_std = rank_ic('lh_rtnDiff', 'return')
ic_lh_volDiff_mean, ic_lh_volDiff_std = rank_ic('lh_volDiff', 'return')
ic_lh_stdDiff_mean, ic_lh_stdDiff_std = rank_ic('lh_stdDiff', 'return')

# 计算 ICIR (IC mean / IC std)
icir_lh_rtnDiff = ic_lh_rtnDiff_mean / ic_lh_rtnDiff_std if ic_lh_rtnDiff_std != 0 else np.nan
icir_lh_volDiff = ic_lh_volDiff_mean / ic_lh_volDiff_std if ic_lh_volDiff_std != 0 else np.nan
icir_lh_stdDiff = ic_lh_stdDiff_mean / ic_lh_stdDiff_std if ic_lh_stdDiff_std != 0 else np.nan

# 显示 Rank IC 结果
print(f"lh_rtnDiff IC: mean={ic_lh_rtnDiff_mean}, std={ic_lh_rtnDiff_std}, ICIR={icir_lh_rtnDiff}")
print(f"lh_volDiff IC: mean={ic_lh_volDiff_mean}, std={ic_lh_volDiff_std}, ICIR={icir_lh_volDiff}")
print(f"lh_stdDiff IC: mean={ic_lh_stdDiff_mean}, std={ic_lh_stdDiff_std}, ICIR={icir_lh_stdDiff}")


lh_rtnDiff IC: mean=-1.2010783322195597e-05, std=0.016713078374647236, ICIR=-0.0007186457846338622
lh_volDiff IC: mean=0.0015685578480131735, std=0.015416371832078802, ICIR=0.10174623868044465
lh_stdDiff IC: mean=0.0016767165795684676, std=0.015500684773734875, ICIR=0.10817048433947762


In [69]:
import pandas as pd
import numpy as np

# 多空组合回测
def long_short_test(factor_col):
    """多空组合回测，买入因子值最高的30%股票，卖出因子值最低的30%股票"""
    results = []
    for date, group in merged_data.groupby('date'):
        # 确保分组内有足够的数据点，最小数量需要是3个（1个多头，1个空头，1个剩余）
        if len(group) > 3:
            # 将 'return' 列和相关因子列转换为 float 类型，确保兼容
            group['return'] = group['return'].astype(float)
            group[factor_col] = group[factor_col].astype(float)
            
            # 按因子值排序
            group = group.sort_values(by=factor_col)
            long_portfolio = group.iloc[-int(len(group) * 0.3):]  # 拿到最高30%的因子值
            short_portfolio = group.iloc[:int(len(group) * 0.3)]  # 拿到最低30%的因子值
            
            # 计算多头、空头和多空组合收益率
            if len(long_portfolio) > 0 and len(short_portfolio) > 0:
                long_return = long_portfolio['return'].mean()
                short_return = short_portfolio['return'].mean()
                long_short_return = long_return - short_return
                results.append(long_short_return)
            else:
                # 如果没有足够的多头或空头组合
                results.append(np.nan)
        else:
            # 如果样本数不足，则跳过
            results.append(np.nan)
    
    # 如果没有有效数据点，返回 NaN
    if len(results) == 0 or all(np.isnan(results)):
        return np.nan, np.nan, np.nan, np.nan
    
    # 转换为 DataFrame 并计算年化收益率、波动率、夏普比率等指标
    results_df = pd.DataFrame(results, columns=['long_short_return']).dropna()
    if len(results_df) == 0:
        return np.nan, np.nan, np.nan, np.nan

    annual_return = results_df['long_short_return'].mean() * 252
    annual_volatility = results_df['long_short_return'].std() * np.sqrt(252)
    sharpe_ratio = annual_return / annual_volatility if annual_volatility != 0 else np.nan
    max_drawdown = results_df['long_short_return'].min()
    
    return annual_return, annual_volatility, sharpe_ratio, max_drawdown

# 计算每个因子的多空组合表现
annual_return_rtnDiff, annual_vol_rtnDiff, sharpe_rtnDiff, max_drawdown_rtnDiff = long_short_test('lh_rtnDiff')
annual_return_volDiff, annual_vol_volDiff, sharpe_volDiff, max_drawdown_volDiff = long_short_test('lh_volDiff')
annual_return_stdDiff, annual_vol_stdDiff, sharpe_stdDiff, max_drawdown_stdDiff = long_short_test('lh_stdDiff')

# 显示多空组合测试结果
print(f"lh_rtnDiff: 年化收益率={annual_return_rtnDiff}, 年化波动率={annual_vol_rtnDiff}, 夏普比率={sharpe_rtnDiff}, 最大回撤={max_drawdown_rtnDiff}")
print(f"lh_volDiff: 年化收益率={annual_return_volDiff}, 年化波动率={annual_vol_volDiff}, 夏普比率={sharpe_volDiff}, 最大回撤={max_drawdown_volDiff}")
print(f"lh_stdDiff: 年化收益率={annual_return_stdDiff}, 年化波动率={annual_vol_stdDiff}, 夏普比率={sharpe_stdDiff}, 最大回撤={max_drawdown_stdDiff}")

lh_rtnDiff: 年化收益率=-0.0017817508790120271, 年化波动率=0.0013687703174053978, 夏普比率=-1.3017164796424454, 最大回撤=-0.0003744639212557424
lh_volDiff: 年化收益率=-0.0017329076959887244, 年化波动率=0.0016190383263825815, 夏普比率=-1.0703314849010162, 最大回撤=-0.0003921990370828741
lh_stdDiff: 年化收益率=-0.0030163476082950965, 年化波动率=0.001660666464972306, 夏普比率=-1.8163476362758963, 最大回撤=-0.0004623263533109667


1. 日内高价/低价序列计算：

	•	你需要先通过分钟数据，分别计算出日内每分钟的最高价和最低价序列。
	•	然后，针对高价和低价分别计算排名百分位，并进行15分钟的滑动窗口累计排名。

2. 高低价时间点的确定：

	•	对于每一天的分钟数据，找到排名最高的高价时点和排名最低的低价时点。

3. 构建因子：

	•	根据高价时点和低价时点的成交量、波动率等信息，构建不同的差异因子。

In [74]:
import pandas as pd
import numpy as np

# 假设 `data` 是日内分钟级别的数据，包含每只股票的'open', 'high', 'low', 'close', 'volume'等信息。
# 确保 'date' 列是 datetime 类型，且包含交易的具体分钟
data['date'] = pd.to_datetime(data['date'])

# 确保数值列为 float 类型，避免 Decimal 和 float 类型混用
data['open'] = data['open'].astype(float)
data['high'] = data['high'].astype(float)
data['low'] = data['low'].astype(float)
data['volume'] = data['volume'].astype(float)

In [82]:
# Step 1: 计算每分钟高低价的排名百分位
def pct_rank(s):
    return s.rank(pct=True)

# 对每只股票的每个交易日计算高价、低价的排名
data['high_rank'] = data.groupby(['stkcd', data['date'].dt.date])['high'].transform(pct_rank)
data['low_rank'] = data.groupby(['stkcd', data['date'].dt.date])['low'].transform(pct_rank)

# Step 2: 计算15分钟的滑动窗口的累加高低价排名
data['highRank_sum'] = data.groupby('stkcd')['high_rank'].transform(lambda x: x.rolling(window=15, min_periods=1).sum())
data['lowRank_sum'] = data.groupby('stkcd')['low_rank'].transform(lambda x: x.rolling(window=15, min_periods=1).sum())

# Step 3: 确定高价/低价的时点（找到累计排名最大的高价和最小的低价的时刻）
data['high_dt'] = data.groupby(['stkcd', data['date'].dt.date])['highRank_sum'].transform(np.argmax)
data['low_dt'] = data.groupby(['stkcd', data['date'].dt.date])['lowRank_sum'].transform(np.argmin)

In [88]:
print(data.columns)

Index(['date', 'stkcd', 'open', 'high', 'low', 'close', 'volume', 'money',
       'session', 'hour', 'hour_segment', 'price_diff', 'time_period',
       'return', 'high_rank', 'low_rank', 'highRank_sum', 'lowRank_sum',
       'high_dt', 'low_dt'],
      dtype='object')


In [91]:
import pandas as pd
import numpy as np

# 假设你的数据 DataFrame 是 'data'
data['date'] = pd.to_datetime(data['date'])  # 确保 'date' 列为 datetime 类型
data['open'] = data['open'].astype(float)    # 转换数值列为 float
data['high'] = data['high'].astype(float)
data['low'] = data['low'].astype(float)
data['volume'] = data['volume'].astype(float)

# 定义计算波动率差异和成交量差异的函数
def calc_diff_std_vol(group):
    # 计算波动率差异 (高价-低价)/开盘价
    diff_std = (group['high'].max() - group['low'].min()) / group['open'].mean()
    
    # 计算成交量差异，避免除以零的情况
    vol_min = group['volume'].min()
    if vol_min == 0:
        diff_vol = np.nan  # 如果最小值为0，则设为 NaN
    else:
        diff_vol = group['volume'].max() / vol_min
    
    return pd.Series({'diff_std': diff_std, 'diff_vol': diff_vol})

# 按股票代码和日期分组，计算每组的波动率和成交量差异
grouped_data = data.groupby(['stkcd', data['date'].dt.date]).apply(calc_diff_std_vol).reset_index()

# 确保 'date' 列的类型一致
grouped_data['date'] = pd.to_datetime(grouped_data['date'])  # 将 'grouped_data' 中的 'date' 列转换为 datetime 类型

# 合并计算结果回到原始数据中
merged_data = pd.merge(data, grouped_data, on=['stkcd', 'date'], how='left')

# 检查结果
print(merged_data[['stkcd', 'date', 'diff_std', 'diff_vol']].head())

         stkcd       date  diff_std  diff_vol
0  000001.XSHE 2022-12-01  0.044682       NaN
1  000001.XSHE 2022-12-01  0.044682       NaN
2  000001.XSHE 2022-12-01  0.044682       NaN
3  000001.XSHE 2022-12-01  0.044682       NaN
4  000001.XSHE 2022-12-01  0.044682       NaN


In [92]:
# 查看每组的最小成交量是否为 0
volume_min_check = data.groupby(['stkcd', data['date'].dt.date])['volume'].min()
print(volume_min_check[volume_min_check == 0])

stkcd        date      
000001.XSHE  2022-12-01    0.0
             2022-12-02    0.0
             2022-12-05    0.0
             2022-12-06    0.0
             2022-12-07    0.0
                          ... 
000010.XSHE  2024-08-16    0.0
             2024-08-19    0.0
             2024-08-20    0.0
             2024-08-21    0.0
             2024-08-22    0.0
Name: volume, Length: 3781, dtype: float64


In [93]:
import pandas as pd
import numpy as np

# 假设你的数据 DataFrame 是 'data'
data['date'] = pd.to_datetime(data['date'])  # 确保 'date' 列为 datetime 类型
data['open'] = data['open'].astype(float)    # 转换数值列为 float
data['high'] = data['high'].astype(float)
data['low'] = data['low'].astype(float)
data['volume'] = data['volume'].astype(float)

# 定义计算波动率差异和成交量差异的函数
def calc_diff_std_vol(group):
    # 计算波动率差异 (高价-低价)/开盘价
    diff_std = (group['high'].max() - group['low'].min()) / group['open'].mean()
    
    # 计算成交量差异，避免除以零的情况
    vol_min = group['volume'].min()
    if vol_min == 0:
        diff_vol = 0  # 如果最小值为0，则将成交量差异设为0
    else:
        diff_vol = group['volume'].max() / vol_min
    
    return pd.Series({'diff_std': diff_std, 'diff_vol': diff_vol})

# 按股票代码和日期分组，计算每组的波动率和成交量差异
grouped_data = data.groupby(['stkcd', data['date'].dt.date]).apply(calc_diff_std_vol).reset_index()

# 确保 'date' 列的类型一致
grouped_data['date'] = pd.to_datetime(grouped_data['date'])  # 将 'grouped_data' 中的 'date' 列转换为 datetime 类型

# 合并计算结果回到原始数据中
merged_data = pd.merge(data, grouped_data, on=['stkcd', 'date'], how='left')

# 检查结果
print(merged_data[['stkcd', 'date', 'diff_std', 'diff_vol']].head())

         stkcd       date  diff_std  diff_vol
0  000001.XSHE 2022-12-01  0.044682       0.0
1  000001.XSHE 2022-12-01  0.044682       0.0
2  000001.XSHE 2022-12-01  0.044682       0.0
3  000001.XSHE 2022-12-01  0.044682       0.0
4  000001.XSHE 2022-12-01  0.044682       0.0


接下来我们需要对日内股价分域特征基础因子进行因子表现的评估和测试。这可以通过如下几个步骤实现：

	1.	因子IC计算：
	•	计算每个因子的IC均值、IC标准差和ICIR。
	•	IC（Information Coefficient）表示因子对股票价格预测的能力，计算方式通常为因子值与未来收益之间的相关系数。
	2.	因子多空组合测试：
	•	对不同因子进行多空组合测试，计算多空组合收益率、年化波动率、夏普比率、最大回撤等指标。

实现步骤：

第一步：IC计算

我们可以使用 Spearman 相关系数来计算每个因子的IC。

In [101]:
# 假设 'close' 是收盘价，我们计算未来1天的收益率
data['future_return'] = data.groupby('stkcd')['close'].shift(-1) / data['close'] - 1

In [103]:
data

Unnamed: 0,date,stkcd,open,high,low,close,volume,money,session,hour,...,price_diff,time_period,return,high_rank,low_rank,highRank_sum,lowRank_sum,high_dt,low_dt,future_return
705600,2022-12-01,000001.XSHE,1657.91,1691.37,1656.67,1687.6500000000,160780.0,268431460.0000000000,morning,9,...,-232.6800000000,morning_half_hour,,0.994792,0.978125,0.994792,0.978125,14,477,0
806400,2022-12-01,000001.XSHE,1657.91,1691.37,1656.67,1687.6500000000,160780.0,268431460.0000000000,morning,9,...,-232.6800000000,morning_half_hour,0,0.994792,0.978125,1.989583,1.956250,14,477,-0.0117500666607412674428939650
705601,2022-12-01,000001.XSHE,1685.17,1692.61,1666.59,1667.8200000000,65530.0,109918445.0000000000,morning,9,...,-232.6800000000,morning_half_hour,-0.0117500666607412674428939650,0.998958,0.996875,2.988542,2.953125,14,477,0
806401,2022-12-01,000001.XSHE,1685.17,1692.61,1666.59,1667.8200000000,65530.0,109918445.0000000000,morning,9,...,-232.6800000000,morning_half_hour,0,0.998958,0.996875,3.987500,3.950000,14,477,-0.0007374896571572471849480160
705602,2022-12-01,000001.XSHE,1667.82,1670.30,1666.59,1666.5900000000,42739.0,71314814.0000000000,morning,9,...,-232.6800000000,morning_half_hour,-0.0007374896571572471849480160,0.990625,0.996875,4.978125,4.946875,14,477,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
403195,2024-08-22,000010.XSHE,22.47,22.47,22.33,22.33,16979.0,379664.0,afternoon,14,...,-21.93,afternoon_half_hour,-0.006231,0.120833,0.058333,1.812500,1.029167,33,231,0.0
403196,2024-08-22,000010.XSHE,22.47,22.47,22.33,22.33,8756.0,196265.0,afternoon,14,...,-21.93,afternoon_half_hour,0.0,0.120833,0.058333,1.812500,1.029167,33,231,0.0
403197,2024-08-22,000010.XSHE,22.33,22.33,22.33,22.33,0.0,0.0,afternoon,14,...,-21.93,afternoon_half_hour,0.0,0.008333,0.058333,1.700000,1.029167,33,231,0.0
403198,2024-08-22,000010.XSHE,22.33,22.33,22.33,22.33,0.0,0.0,afternoon,14,...,-21.93,afternoon_half_hour,0.0,0.008333,0.058333,1.587500,1.029167,33,231,0.00627


In [104]:
# 如果需要保存结果到文件，例如Mac桌面：
data.to_csv('/Users/zhangrui/Desktop/factor_data.csv', index=False)

In [107]:
# 使用收盘价生成未来一天的收益率
merged_data['future_return'] = merged_data.groupby('stkcd')['close'].shift(-1) / merged_data['close'] - 1

# 检查 'future_return' 列是否生成正确
print(merged_data[['stkcd', 'date', 'close', 'future_return']].head())

         stkcd       date            close                    future_return
0  000001.XSHE 2022-12-01  1687.6500000000                                0
1  000001.XSHE 2022-12-01  1687.6500000000  -0.0117500666607412674428939650
2  000001.XSHE 2022-12-01  1667.8200000000                                0
3  000001.XSHE 2022-12-01  1667.8200000000  -0.0007374896571572471849480160
4  000001.XSHE 2022-12-01  1666.5900000000                                0


In [108]:
from scipy.stats import spearmanr

# 定义计算IC的函数
def calc_ic(merged_data, factor_name):
    # 计算IC，使用Spearman相关系数
    ic, _ = spearmanr(merged_data[factor_name], merged_data['future_return'])
    return ic

# 计算 diff_std、diff_vol 的IC
ic_diff_std = calc_ic(merged_data, 'diff_std')
ic_diff_vol = calc_ic(merged_data, 'diff_vol')

# 打印结果
print(f"diff_std IC: {ic_diff_std}")
print(f"diff_vol IC: {ic_diff_vol}")

diff_std IC: nan
diff_vol IC: nan




In [109]:
# 检查 diff_std 和 diff_vol 是否包含恒定值或 NaN 值
print("diff_std 值分布:")
print(merged_data['diff_std'].describe())

print("diff_vol 值分布:")
print(merged_data['diff_vol'].describe())

# 检查是否有 NaN 值
print("diff_std 是否有 NaN 值:", merged_data['diff_std'].isna().sum())
print("diff_vol 是否有 NaN 值:", merged_data['diff_vol'].isna().sum())

diff_std 值分布:
count    988800.000000
mean          0.031622
std           0.023306
min           0.000000
25%           0.016854
50%           0.024617
75%           0.038806
max           0.206526
Name: diff_std, dtype: float64
diff_vol 值分布:
count    1008240.0
mean           0.0
std            0.0
min            0.0
25%            0.0
50%            0.0
75%            0.0
max            0.0
Name: diff_vol, dtype: float64
diff_std 是否有 NaN 值: 19440
diff_vol 是否有 NaN 值: 0


In [110]:
# 计算 diff_std、diff_vol 的IC
ic_diff_std = calc_ic(merged_data, 'diff_std')
ic_diff_vol = calc_ic(merged_data, 'diff_vol')

# 打印结果
print(f"diff_std IC: {ic_diff_std}")
print(f"diff_vol IC: {ic_diff_vol}")

# 分别计算 diff_std、diff_vol 的IC统计结果
ic_mean_std, ic_std_std, ic_ir_std = calc_ic_stats(merged_data, 'diff_std')
ic_mean_vol, ic_std_vol, ic_ir_vol = calc_ic_stats(merged_data, 'diff_vol')

# 打印统计结果
print(f"diff_std IC均值: {ic_mean_std}, IC标准差: {ic_std_std}, ICIR: {ic_ir_std}")
print(f"diff_vol IC均值: {ic_mean_vol}, IC标准差: {ic_std_vol}, ICIR: {ic_ir_vol}")

# 分别计算 diff_std、diff_vol 的多空组合收益率和波动率
annual_return_std, vol_std = long_short_strategy(merged_data, 'diff_std')
annual_return_vol, vol_vol = long_short_strategy(merged_data, 'diff_vol')

# 打印结果
print(f"diff_std 年化收益率: {annual_return_std}, 年化波动率: {vol_std}")
print(f"diff_vol 年化收益率: {annual_return_vol}, 年化波动率: {vol_vol}")



diff_std IC: nan
diff_vol IC: nan
diff_std IC均值: -0.007524932532888245, IC标准差: 0.01791000367472435, ICIR: -0.4201524840281229
diff_vol IC均值: nan, IC标准差: nan, ICIR: nan


TypeError: unsupported operand type(s) for +: 'decimal.Decimal' and 'float'

第二步：IC的统计分析

为了获得 IC 的均值和标准差，可以将上述计算 IC 的过程应用于一段时间内的数据，计算多个时间段的IC并统计结果。

In [97]:
def calc_ic_stats(merged_data, factor_name):
    ic_values = merged_data.groupby('date').apply(lambda x: calc_ic(x, factor_name))
    ic_mean = ic_values.mean()
    ic_std = ic_values.std()
    ic_ir = ic_mean / ic_std  # 计算ICIR
    return ic_mean, ic_std, ic_ir

# 分别计算 diff_idx, diff_std, diff_vol 的IC统计结果
ic_mean_idx, ic_std_idx, ic_ir_idx = calc_ic_stats(merged_data, 'diff_idx')
ic_mean_std, ic_std_std, ic_ir_std = calc_ic_stats(merged_data, 'diff_std')
ic_mean_vol, ic_std_vol, ic_ir_vol = calc_ic_stats(merged_data, 'diff_vol')

# 打印统计结果
print(f"diff_idx IC均值: {ic_mean_idx}, IC标准差: {ic_std_idx}, ICIR: {ic_ir_idx}")
print(f"diff_std IC均值: {ic_mean_std}, IC标准差: {ic_std_std}, ICIR: {ic_ir_std}")
print(f"diff_vol IC均值: {ic_mean_vol}, IC标准差: {ic_std_vol}, ICIR: {ic_ir_vol}")

KeyError: 'diff_idx'

第三步：多空组合测试

为每个因子构建多空组合，计算每个因子的多空收益率和相关的金融指标。

In [98]:
def long_short_strategy(merged_data, factor_name, quantile=0.2):
    # 计算每个因子的多空组合收益率
    merged_data['rank'] = merged_data[factor_name].rank()
    long = merged_data[merged_data['rank'] >= merged_data['rank'].quantile(1 - quantile)]
    short = merged_data[merged_data['rank'] <= merged_data['rank'].quantile(quantile)]
    
    # 多头和空头组合的平均收益率
    long_return = long['future_return'].mean()
    short_return = short['future_return'].mean()
    
    # 多空组合的年化收益率和年化波动率
    annual_return = (long_return - short_return) * 252  # 假设一年有252个交易日
    volatility = merged_data['future_return'].std() * np.sqrt(252)
    
    return annual_return, volatility

# 分别计算 diff_idx、diff_std、diff_vol 的多空组合收益率和波动率
annual_return_idx, vol_idx = long_short_strategy(merged_data, 'diff_idx')
annual_return_std, vol_std = long_short_strategy(merged_data, 'diff_std')
annual_return_vol, vol_vol = long_short_strategy(merged_data, 'diff_vol')

# 打印结果
print(f"diff_idx 年化收益率: {annual_return_idx}, 年化波动率: {vol_idx}")
print(f"diff_std 年化收益率: {annual_return_std}, 年化波动率: {vol_std}")
print(f"diff_vol 年化收益率: {annual_return_vol}, 年化波动率: {vol_vol}")

KeyError: 'diff_idx'

总结：

	•	IC计算 用于衡量因子的预测能力，ICIR越高，因子的表现越好。
	•	多空组合测试 通过构建因子的多空组合，评估其收益率和波动性等指标。

3. 自身显著性因子构建

基于显著性理论，从股票自身的日内交易特征出发，构建特异性因子。

In [44]:
# 分析显著性：计算每个时间段的波动率和成交量特征
def calc_intraday_volatility(df):
    return np.std(df['close'].pct_change())

data['intraday_volatility'] = data.groupby('symbol').apply(calc_intraday_volatility)

# 构建显著性因子
def calc_significance(df):
    vol = df['intraday_volatility'].mean()
    volume = df['volume'].mean()
    return vol * volume  # 简化的显著性公式，具体需参考研报中的显著性算法

data['significance'] = data.groupby('symbol').apply(calc_significance)

KeyError: 'symbol'

4. 股票间相似性与“同伴”显著性因子构建

使用日内量价数据进行股票间的成对相似性计算，刻画“同伴”显著性因子。

In [None]:
# 构建股票间的相似性矩阵
def calc_pairwise_similarity(df1, df2):
    return np.corrcoef(df1['close'], df2['close'])[0, 1]

# 基于相似性计算同伴显著性因子
symbols = data['symbol'].unique()
similarity_matrix = pd.DataFrame(index=symbols, columns=symbols)

for sym1 in symbols:
    for sym2 in symbols:
        similarity_matrix.loc[sym1, sym2] = calc_pairwise_similarity(data[data['symbol'] == sym1], data[data['symbol'] == sym2])

# 构建“同伴”显著性因子

5. 因子回测与表现评估

根据研报中的回测规则，对构建的因子进行回测，并计算回测指标，如IC、夏普比率、年化收益率等。

In [None]:
# 回测因子表现
def backtest_factor(df, factor_column):
    # 示例回测策略：T+1买入因子值最高的股票
    long_positions = df.groupby('symbol')[factor_column].shift(1).rank(ascending=False) < 10
    returns = df['close'].pct_change().shift(-1)
    strategy_returns = (returns * long_positions).mean()
    return strategy_returns

# 测试因子的表现
factor_performance = backtest_factor(data, 'significance')
print(f"因子表现: {factor_performance}")

6. 风险分析与提示

考虑市场环境变化对因子表现的影响，并引入风险提示。

In [None]:
# 分析模型在不同市场环境下的表现
def analyze_risk(df, factor_column):
    # 比如分牛市、熊市环境分析因子表现
    bull_market = df['market_trend'] == 'bull'
    bear_market = df['market_trend'] == 'bear'
    
    bull_performance = backtest_factor(df[bull_market], factor_column)
    bear_performance = backtest_factor(df[bear_market], factor_column)
    
    print(f"Bull market performance: {bull_performance}, Bear market performance: {bear_performance}")

analyze_risk(data, 'significance')