## 作业要求

1. 优化 target mean 算法过程
2. 用 Cython 加速
3. 加入并行
4. 附加题：完全符合真实工作需求，得分较高者会额外，优先进行内推
    * 查看B-spline的介绍。
    * 使用 cython 实现对输入多列返回 b-spline basis 的操作。
    * 注意：禁止使用函数 recursive call。 
    * 注意：必须要处理异常情况（例如缺失值，inf 等）。

### 随机数据

In [647]:
import numpy as np
import pandas as pd
import time

In [648]:
y = np.random.randint(2, size=(5000, 1))
x = np.random.randint(10, size=(5000, 1))
data = pd.DataFrame(np.concatenate([y, x], axis=1), columns=['y', 'x'])

### 示例算法

In [649]:
def target_mean_v1(data, y_name, x_name):
    result = np.zeros(data.shape[0])
    for i in range(data.shape[0]):
        groupby_result = data[data.index != i].groupby([x_name], as_index=False).agg(['mean', 'count'])
        result[i] = groupby_result.loc[groupby_result.index == data.loc[i, x_name], (y_name, 'mean')]
    return result

In [650]:
%%time
v1_ans = target_mean_v1(data, 'y', 'x')

CPU times: user 38.8 s, sys: 9.98 ms, total: 38.8 s
Wall time: 52.6 s


### Python 优化

In [651]:
def target_mean_python(data, y_name, x_name):
    temp = data.groupby(x_name).agg(['sum', 'count']).reset_index()
    compute = lambda y, x, array: (array[x][1] - y) / (array[x][2] - 1)
    data_dict = {(y,x): compute(y, x, temp.values) for y in data[y_name].unique() for x in temp[x_name].unique()}
    result = np.array([data_dict[(y,x)] for _, (y, x) in data.iterrows()])
    return result

In [652]:
%%time
python_ans = target_mean_python(data, 'y', 'x')
assert (v1_ans == python_ans).all()
print('pass')

pass
CPU times: user 513 ms, sys: 0 ns, total: 513 ms
Wall time: 674 ms


### Pandas 优化

In [653]:
def target_mean_pandas(data, y_name, x_name):
    temp = data.groupby(x_name).agg(['sum', 'count']).droplevel(0, axis=1).reset_index()
    df_target_mean = pd.DataFrame(dtype=float, columns=temp.columns+[y_name])
    for target in data[y_name].unique():
        temp[y_name] = target
        df_target_mean = pd.concat([temp, df_target_mean], axis=0)
    df_target_mean['result'] = (df_target_mean['sum'] - df_target_mean[y_name]) / (df_target_mean['count'] - 1)
    result = pd.merge(data, df_target_mean, on=[y_name, x_name], how='left')['result'].values
    return result

In [654]:
%%time
pandas_ans = target_mean_pandas(data, 'y', 'x')
assert (v1_ans == pandas_ans).all()
print('pass')

pass
CPU times: user 25.9 ms, sys: 982 µs, total: 26.9 ms
Wall time: 57.9 ms


### Cython 优化

1. for 循环求sum和count
2. for 循环计算所有样本
3. 优化以上两个点