<a href="https://colab.research.google.com/github/xx529/Others/blob/main/GeekBang/chap02-homework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 作业要求

1. 优化 target mean 算法过程
    * 用 Cython 加速
    * 加入并行
2. 使用 cython 实现对输入多列返回 B-spline basis 的操作。
    * 查看 B-spline 的介绍。 https://www.cs.unc.edu/~dm/UNC/COMP258/Papers/bsplbasic.pdf
    * 禁止使用函数 recursive call。 
    * 必须要处理异常情况（例如缺失值，inf 等）。


### 随机数据

In [36]:
import numpy as np
import pandas as pd
import time

In [37]:
y = np.random.randint(2, size=(5000, 1))
x = np.random.randint(10, size=(5000, 1))
data = pd.DataFrame(np.concatenate([y, x], axis=1), columns=['y', 'x'])

In [38]:
data

Unnamed: 0,y,x
0,1,8
1,1,8
2,0,3
3,1,6
4,1,9
...,...,...
4995,1,9
4996,1,9
4997,1,3
4998,0,6


### 示例算法

In [39]:
def target_mean_v1(data, y_name, x_name):
    result = np.zeros(data.shape[0])
    for i in range(data.shape[0]):
        groupby_result = data[data.index != i].groupby([x_name], as_index=False).agg(['mean', 'count'])
        result[i] = groupby_result.loc[groupby_result.index == data.loc[i, x_name], (y_name, 'mean')]
    return result

In [40]:
%%time
v1_ans = target_mean_v1(data, 'y', 'x')

CPU times: user 25.4 s, sys: 9.52 ms, total: 25.4 s
Wall time: 25.6 s


### Python 优化

In [41]:
def target_mean_python(data, y_name, x_name):
    temp = data.groupby(x_name).agg(['sum', 'count']).reset_index()
    compute = lambda y, x, array: (array[x][1] - y) / (array[x][2] - 1)
    data_dict = {(y,x): compute(y, x, temp.values) for y in data[y_name].unique() for x in temp[x_name].unique()}
    result = np.array([data_dict[(y,x)] for _, (y, x) in data.iterrows()])
    return result

In [42]:
%%time
python_ans = target_mean_python(data, 'y', 'x')
assert (v1_ans == python_ans).all()
print('pass')

pass
CPU times: user 374 ms, sys: 2 ms, total: 376 ms
Wall time: 378 ms


### Pandas 优化

In [43]:
def target_mean_pandas(data, y_name, x_name):
    temp = data.groupby(x_name).agg(['sum', 'count']).droplevel(0, axis=1).reset_index()
    df_target_mean = pd.DataFrame(dtype=float, columns=temp.columns+[y_name])
    for target in data[y_name].unique():
        temp[y_name] = target
        df_target_mean = pd.concat([temp, df_target_mean], axis=0)
    df_target_mean['result'] = (df_target_mean['sum'] - df_target_mean[y_name]) / (df_target_mean['count'] - 1)
    result = pd.merge(data, df_target_mean, on=[y_name, x_name], how='left')['result'].values
    return result

In [44]:
%%time
pandas_ans = target_mean_pandas(data, 'y', 'x')
assert (v1_ans == pandas_ans).all()
print('pass')

pass
CPU times: user 23 ms, sys: 4 ms, total: 27 ms
Wall time: 27.7 ms


### Cython 优化

In [52]:
def target_mean_cython(data, y_name, x_name):
    temp_dict = {}
    result = np.zeros(shape=data.shape[0])

    for i, row_data in data.iterrows():
        x_value, y_value = row_data[x_name], row_data[y_name]
        
        if x_value not in temp_dict:
            temp_dict[x_value] = {'sum': 0, 'count': 0}
        
        temp_dict[x_value]['sum'] += y_value
        temp_dict[x_value]['count'] += 1

    for i, row_data in data.iterrows():
        x_value, y_value = row_data[x_name], row_data[y_name]
        result[i] = (temp_dict[x_value]['sum'] - y_value) / (temp_dict[x_value]['count'] - 1)
    
    return result

In [53]:
%%time
cython_ans = target_mean_cython(data, 'y', 'x')
assert (v1_ans == cython_ans).all()
print('pass')

pass
CPU times: user 826 ms, sys: 971 µs, total: 827 ms
Wall time: 835 ms
